# SGD(Schema-Guided Dialogue) Dataset

- Paper: [Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset](https://arxiv.org/abs/1909.05855) (AAAI 2020)
- Brief Explanation: https://paperswithcode.com/dataset/sgd
- Official Repository: https://github.com/google-research-datasets/dstc8-schema-guided-dialogue

The Schema-Guided Dialogue (SGD) dataset consists of over **20k annotated multi-domain, task-oriented conversations** between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning **20 domains, ranging from banks and events to media, calendar, travel, and weather**. For most of these domains, the dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. The wide range of available annotations can be used for intent prediction, slot filling, dialogue state tracking, policy imitation learning, language generation, user simulation learning, among other tasks in large-scale virtual assistants. Besides these, the dataset has **unseen domains and services in the evaluation set** to quantify the performance in **zero-shot or few shot** settings.

In [51]:
# %load_ext lab_black

In [30]:
import os
import json
import glob
from pprint import pprint

In [3]:
trn_pth = "./train"
dev_pth = "./dev"
tst_pth = "./test"

In [21]:
trn_dialogues = []
for dialog in glob.glob(os.path.join(trn_pth, "dialogues_*.json")):
    with open(dialog) as f:
        json_object = json.load(f)
    trn_dialogues.append(json_object)

In [22]:
# Number of total json file (train)
len(trn_dialogues)

127

In [34]:
# Number of dialogues in each json file (train)
pprint(" ".join([str(len(dlog)) for dlog in trn_dialogues]))

('128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 '
 '128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 '
 '128 128 128 128 128 128 128 128 127 126 128 128 128 115 128 128 128 128 128 '
 '128 128 128 128 128 128 128 62 128 128 128 126 128 128 128 128 128 128 128 '
 '128 128 127 128 128 128 128 128 128 128 128 128 103 128 128 128 128 128 128 '
 '128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 '
 '128 128 128 128 128 128 128 127 128 125 128 128 128')


In [50]:
# Total number of dialogues
def count_data(root):
    total = 0
    for dlog in glob.glob(os.path.join(root, "dialogues_*.json")):
        with open(dlog) as f:
            json_object = json.load(f)
        total += len(json_object)
    return total


pprint(f"Total number of dialogues in trn data is {count_data(trn_pth)}")
pprint(f"Total number of dialogues in dev data is {count_data(dev_pth)}")
pprint(f"Total number of dialogues in tst data is {count_data(tst_pth)}")

'Total number of dialogues in trn data is 16142'
'Total number of dialogues in dev data is 2482'
'Total number of dialogues in tst data is 4201'


### Structure
```sh
dstc8-schema-guided-dialogue/
        └─ train/
            └─ dialogues_001.json
                    ├─ dialog_0 (key: dialogue_id, services, turns)
                    ├─ ...
                    └─ dialog_k (key: dialogue_id, services, turns)
            ├─ ...
            ├─ dialogues_127.json
            └─ schema.json
        ├─ dev/      
        └─ test/
```

In [47]:
sample = trn_dialogues[10][0]

# hierarchical dictionary keys
pprint(sample.keys())
pprint(sample["turns"][0].keys())
pprint(sample["turns"][0]["frames"][0].keys())
pprint(sample["turns"][0]["frames"][0]["actions"][0].keys())

dict_keys(['dialogue_id', 'services', 'turns'])
dict_keys(['frames', 'speaker', 'utterance'])
dict_keys(['actions', 'service', 'slots', 'state'])
dict_keys(['act', 'canonical_values', 'slot', 'values'])


In [39]:
pprint(sample)

{'dialogue_id': '13_00000',
 'services': ['Movies_1'],
 'turns': [{'frames': [{'actions': [{'act': 'INFORM_INTENT',
                                     'canonical_values': ['FindMovies'],
                                     'slot': 'intent',
                                     'values': ['FindMovies']}],
                        'service': 'Movies_1',
                        'slots': [],
                        'state': {'active_intent': 'FindMovies',
                                  'requested_slots': [],
                                  'slot_values': {}}}],
            'speaker': 'USER',
            'utterance': 'A movie will help me overcome the monotonous of the '
                         'day. Is there any good movie?'},
           {'frames': [{'actions': [{'act': 'REQUEST',
                                     'canonical_values': [],
                                     'slot': 'location',
                                     'values': []}],
                        'service'