Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a benchmark dataset of Audio Deepfakes #365

Open
dennyabrain opened this issue Jul 16, 2024 · 1 comment
Open

Create a benchmark dataset of Audio Deepfakes #365

dennyabrain opened this issue Jul 16, 2024 · 1 comment

Comments

@dennyabrain
Copy link
Contributor

dennyabrain commented Jul 16, 2024

Goal

To create a benchmark dataset for audio files to assist evaluation of deepfake detection tools.

Overview

During the first quarter of launch of DAU, a trend that has emerged is the presence of various manipulation techniques in audio content. This also includes video files whose audio is manipulated. As such being able to reliably identify manipulated portions of an audio file is essential. The manipulation techniques noted so far are

  1. Splicing in synthetically generated media in a natural audio recording
  2. Overdubbing a video with mimicry (by a human and hence no synthetic media)
  3. Use of tools like eleven labs to generate synthetic media in a celebrity's voice using text

While work is underway to create techniques that can detect the various types of manipulation technique used in an audio file received by the DAU, we want to create a standard benchmark dataset of audio files. The goal with this dataset is to be a useful tool in evaluating performance of various proprietary and open source tools that we might use in the project.

Working Definitions

To avoid confusion, we will use the following definitions while working on this issue :

  1. Natural Audio : Recording of a person made using a microphone and saved in a digital file
  2. Synthetic Audio : An audio generated from scratch using techniques like Generative AI and consumer apps like midjourney, canva etc
  3. Audio Efffects : This could be the application of any DSP technique like stretching, slowing down on a natural audio file

Scope of the task

  1. List about 10-15 public figures split into language, accent and gender.
  2. get their audio recording from publicly available repositories like youtube.
  3. strip the audio and generate different versions of the audio e.g. single sentence, long speech, monologue. where applicable.
  4. Automatically generate transcript of their speech.
  5. Convert the transcripts back to synthetic data using open models and proprietary models. The dataset will include a column to mark how the synthetic media was generated.

Deliverable

An open dataset with the following columns

  1. Name of the celebrity
  2. Language being spoken in the audio
  3. Gender
  4. Quality of the audio
  5. Natural or Synthetic
  6. if Synthetic, tool used

Approach

Lets plan to work on this collaboratively. We can discuss :

  1. which celebrity's data we are working on
  2. which transcription tool we are using;
  3. which tool are we using to generate synthetic audio

Having a mix of techniques and transcription tools shouldn't hurt. But it would be nice if we keep sharing our progress here so we're not solving problems that we have a working solution for.

@dennyabrain
Copy link
Contributor Author

@swairshah has begun preliminary exploration here - https://github.com/swairshah/audio-research

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant