Skip to content

wspr-ncsu/robocall-audio-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robocall Audio Dataset

Paper | Google Colab Example | Website | Citation BibTex

Dataset Summary

Robocall Audio Dataset is a collection of over one thousand audio recordings of automated or semi-automated phone calls. Such calls are commonly called robocalls. These recordings were made available by the FTC through the Project Point of No Entry initiative (FTC link, FTC News, Web Archive link). The dataset consists of over a thousand robocall audio recording used in the real-world. Most of these robocalls are suspected illegal calls. Malicious actors used a majority of these recordings to defraud people. The dataset also includes the cease and desist letters sent by the FTC to the suspected call-originating entity (telephone carrier or the robocaller).

Data Collection

Each audio recording was collected using the links embedded within the Cease and Desist letters sent by the FTC to the suspected call-originating entity (telephone carrier or the robocaller). The webpage and the PDF files published on the PPoNE website were collected using automated crawlers. Links embedded within the PDF were extracted using pdfgrep and then downloaded using wget.

Audio Recording Setup

Although this dataset does not contain granular information about where or how these audio example were collected, most example robocall audio recordings are collected using telephony honeypots, voicemails, or reports from phone users who may have recorded the call using their own devices. These calls were likely generated by a robocalling system, and the audio traversed the phone network (over a logical channel) before being recorded by the recipient.

Curating and Normalizing the Dataset

Since these recordings are sourced from various honeypots and voicemails, the original audio format included wav, amr, and mp3. Some recordings were in stereo and others in mono.

The recordings were converted to WAV (pcm_s16le) and resampled to 16kHz using ffmpeg. When the source audio was in stereo, it was converted into two mono streams (filenames _left.wav and _right.wav). The _left.wav contains the audio stream originated by the remote party (robocaller), and the _right.wav contains the audio stream originated by the local party (honeypot or voicemail). Only the _left.wav files were transcribed and included in the dataset. However, the respective _right.wav audio files are also included in the dataset for completeness.

Dataset Format

The metadata.csv format contains the filename and the transcription of the audio recording. It also includes the language used within the call and was detected automatically using Whisper. The dataset consists of 1432 calls out of which 96.2% (1378) calls are in english and 3.8% (54) are in Mandarin/Chinese. The medium (multilingual) model was used to transcribe the audio. The specific cease and desist letter or the warning letter is also included for each audio recording.

Cease and Desist Letters and Warning Letters

The cease and desist and warning letters are included in the pdf format in the pdf_files directory. The case_pdf column in the metadata.csv file contains the link to the specific letter for each audio recording.

How to use the dataset

The dataset is hosted on GitHub and can be easily accessed using Pandas and HuggingFace datasets.

import pandas as pd

df = pd.read_csv('metadata.csv')

df.columns
#Output: Index(['file_name', 'language', 'transcript', 'case_details', 'case_pdf'], dtype='object')

df.head()

The dataset can also be loaded using Huggingface's datasets library.

from datasets import Dataset, Audio
import pandas as pd

df = pd.read_csv('metadata.csv')

audio_dataset = Dataset.from_dict({
    "audio": df['file_name'].to_list(),
    "transcript": df['transcript'].to_list(),
    "language" : df['language'].to_list(),
    "case_pdf" : df['case_pdf'].to_list(),
}).cast_column("audio", Audio(sampling_rate=16000))


#audio_dataset

# Output
# >> Dataset({
#     features: ['audio', 'transcript', 'language', 'case_pdf'],
#     num_rows: 1432
# })

Inspect individual audio entries

audio_dataset[0]

'''
#Output
{'audio': {'path': 'audio-wav-16khz/1112259_normalized.wav',
  'array': array([0.03210449, 0.03390503, 0.03796387, ..., 0.00616455, 0.00695801,
         0.0072937 ]),
  'sampling_rate': 16000},
 'transcript': 'We would like to inform you that there is an order placed for Apple iPhone 11 Pro using your Amazon account. If you do not authorize this order, press 1 or press 2 to authorize this order. ',
 'language': 'en',
 'case_pdf': 'pdf_files/pointofnoentry-every1telecomceasedesistletterfinaljms.pdf'}

'''

License

This document describing the data is released under the Creative Commons BY-ND [ 1] license. The data itself is in the public domain. If you find this structured data useful, we would appreciate (but do not require) an acknowledgement in any publications

Citation

@techreport{robocallDatasetTechReport,
  author      = {{Sathvik Prasad and Bradley Reaves}},
  title       = {{Robocall Audio from the FTC's Project Point of No Entry}},
  institution = {{North Carolina State University}},
  year        = {2023},
  month       = {Nov},
  number      = {TR-2023-1},
  url         = {}
}

ToDos

Pull Requests are welcome!

  • Extract the caller ID information from the PDF complaints for each call
  • Extract the time of the call form the PDF complaints for each call