Skip to content

Latest commit

 

History

History
87 lines (45 loc) · 4.39 KB

datasheet.md

File metadata and controls

87 lines (45 loc) · 4.39 KB

Datasheet

This is inspired from the Datasheets for datasets paper.

Motivation

Q1) For what purpose was the dataset created ? Was there a specific task in mind ? Was there a specific gap that needed to be filled ?

Ans. This is an evaluation dataset for the task of capturing phone number sequences. There are two tracks - in one utterance and across successive utterances (or chunking of the phone number). There are no datasets we know of that attempt to study this problem, and this is supposed to fill the gap.

Q2) Who created the dataset and on behalf of which entity ?

Ans. The (internal) Operations team at Skit was involved in the generation of the dataset. Manas and Sachin were involved in the curation and collection of utterances, and Anirudh helped with the dataset release. These contributors worked on this dataset as part of the ML team at Skit.

Q3) Who funded the creation of the dataset ?

Ans. Skit funded the creation of this dataset.

Composition

Q4) What do the instances that comprise the dataset consist of ?

Ans. Individual instances of the dataset consist of the following fields: call_id,turn_id, relative_path, tag and speaker_id. For teack 1, The relative_path points to the audio file contains a speaker speaking a phone number, speaker_id refers to the speaker speaking the phone number, the turn_id refers to the turn and the tag is the phone number. For track 2, The relative_path points to the audio file contains a speaker speaking a phone number,speaker_id refers to the speaker speaking the phone number, the turn_id refers to the turn, the tag is the phone number.

Q5) How many instances are there in total (of each type, if appropriate) ?

Ans. There are 146 calls captured in the single turn track and 142 calls captured in the two turns track. Each phone number is spoken by two (Indian English) speakers, in each track.

Q6) Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set ?

Ans. The dataset contains a sample of Indian phone numbers.

Q7) Are there recommended data splits (e.g., training, development/validation, testing) ?

Ans. No there are no recommended data splits per se. This dataset is meant to be for evaluation purposes, only.

Q8) Are there any errors, sources of noise, or redundancies in the dataset?

Ans. There could be background or channel noise present in the dataset, because the data was generated through telephone calls.

Q9) Other comments.

Ans. Speakers were reading phone numbers from a list, so there could be some spontaneity differences from real-world instances.

Collection Process

Q10) How was the data associated with each instance acquired ?

Ans. Members of the (internal) Operation team generated the data from a list of phone numbers - by calling an internal voicebot for data capture.

Q11) Who was involved in the data collection process and how were they compensated ?

Ans. The data was generated by the (internal) Operations team and they are full-time employees.

Q12) Over what timeframe was the data collected ?

Ans. This data was collected over a time period of 1 week.

Q13) Was any preprocessing/cleaning/labelling of the data done ?

Ans. Data instances were labelled with the correct phone number post generation.

Recommended Uses

Q14) Has the dataset been used for any tasks already ?

Ans. It has been used to benchmark various ASR systems for the task of phone number entity capture.

Q15) What (other) tasks could the dataset be used for ?

Ans. The two-turns track can be used to evaluate how system performance is impacted by chunking of phone number entities across turns.

Distribution and Maintenance

Q16) Will the dataset be distributed under a copyright or other intellectual property (IP) license ?

Ans. This dataset is being distributed under a CC BY NC license.

Q17) Who will be maintaining the dataset ?

Ans. The research team at Skit will be maintaining the dataset. They can be contacted by sending an email to ml-research@skit.ai.

Q18) Will the dataset be updated in the future (e.g., to correct labelling errors, add new instances, delete instances) ?

Ans. Incase there are errors, we will try to collate and share an updated version every 3 months. We also plan to add more instances and variations to the dataset - to make it more robust.