<a href="https://colab.research.google.com/github/rmcpantoja/My-Colab-Notebooks/blob/main/notebooks/making_speech_dataset_with_Auditok_%7Dipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab notebook to chunk audio with [auditok](https://github.com/amsehili/auditok).
## Useful for make speech datasets automatically

Notebook made by [rmcpantoja](https://github.com/rmcpantoja/)

## First steps

In [None]:
#@title Mount your Google Drive
#@markdown This is useful if you want to save your generated dataset to your [drive](http://drive.google.com/).
from google.colab import drive
drive.mount('drive', force_remount=True)

In [None]:
#@title install software
#@markdown Here we will install and update pip packages for Python, as well as the repository for the tool, which is the most important thing.

#@markdown * note: please restart the runtime environment when this cell is done executing.

!git clone https://github.com/amsehili/auditok.git
%cd auditok
!python setup.py install
%cd ..
!pip install --upgrade gdown
!pip install yt-dlp pydub
import yt_dlp
from google.colab import files
def download_from_youtube(yt_url):
  ydl_opts = {
      'format': 'bestaudio/best',
      'postprocessors': [{
          'key': 'FFmpegExtractAudio',
          'preferredcodec': 'wav',
          'preferredquality': '320',
      }],
  }

  with yt_dlp.YoutubeDL(ydl_opts) as ydl:
      ydl.download([yt_url])

!mkdir project

## Starting to interact whit the tool

In [None]:
#@title Gett audio data of the voice
#@markdown You can skip this cell if you are going to upload an audio to the left panel in colab

#@markdown **choose the way you want to get audio:**
mode = "download from Youtube (yt-dlp)" #@param ["download from Youtube (yt-dlp)", "download via a Drive id", "Upload"]
input = "Link or ID to download" #@param {type:"string"}
if mode == "download from Youtube (yt-dlp)" and "http" in input:
  download_from_youtube(input)
elif mode == "download via a Drive id" and input.startswith("1"):
  !gdown {input}
elif mode == "Upload":
  input = files.upload()
print("Done!")

In [None]:
#@title Settings
#@markdown Here you can make your preferred settings for the tool, such as manipulating audio duration and silence detection.

#@markdown ---

#@markdown **Choose the minimum duration**
min_duration = 10.0 #@param {type:"number"}
#@markdown ---
#@markdown **Choose the maximum duration**
max_duration = 15.5 #@param {type:"number"}
#@markdown ---
#@markdown **Maximum duration of silences**
max_silence = 0.5 #@param {type:"number"}
#@markdown ---

#@markdown **threshold of detection**
threshold = 55 #@param {type:"integer"}
#@markdown ---

In [None]:
#@title Lets begin the work!
#@markdown But first, here are some things to set up the wai the notebook will save the dataset.

#@markdown **Location of the audio file containing the voice**
audio_dir = "audio.wav" #@param {type:"string"}
#@markdown ---

#@markdown **compress as zip file?**
compress = "True" #@param {type:"boolean"}
#@markdown ---

#@markdown **If it is compressed as zip, where to save it?**
export_dir = "/content/drive/MyDrive/voice1" #@param {type:"string"}
#@markdown ---
import auditok
print("Splitting audio... this may take a short time.\n")
# split returns a generator of AudioRegion objects
audio_regions = auditok.split(
    audio_dir,
    min_dur=min_duration,     # minimum duration of a valid audio event in seconds
    max_dur=max_duration,       # maximum duration of an event
    max_silence=max_silence, # maximum duration of tolerated continuous silence within an event
    energy_threshold=threshold # threshold of detection
)

for i, r in enumerate(audio_regions):

    # Regions returned by `split` have 'start' and 'end' metadata fields
    #print("Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))

    # play detection
    # r.play(progress_bar=True)

    # region's metadata can also be used with the `save` method
    # (no need to explicitly specify region's object and `format` arguments)
    filename = r.save("project/region_{meta.start:.3f}-{meta.end:.3f}.wav")
    #print("region saved as: {}".format(filename))
import os
n_wavs = len(os.listdir("/content/project"))
print(f"Done. the division has been made into {n_wavs} audios.")
if compress:
  print("Compressing wavs...\n")
  import os
  if not os.path.exists(export_dir):
    os.makedirs(export_dir)
  import shutil
  shutil.make_archive(export_dir+"/wavs", 'zip', '/content/project')
  print("compression finished!")

# Related notebooks

Would you like to transcribe these audios? Use Whisper transcription notebook!

* [English notebook](https://colab.research.google.com/github/rmcpantoja/My-Colab-Notebooks/blob/main/notebooks/OpenAI%20Whisper%20-%20DotCSV%20(Speech%20dataset%20multi-transcryption%20support)en.ipynb)
* [French notebook](https://colab.research.google.com/github/rmcpantoja/My-Colab-Notebooks/blob/main/notebooks/OpenAI%20Whisper%20-%20DotCSV%20(Speech%20dataset%20multi-transcryption%20support)fr.ipynb)
* [Spanish notebook](https://colab.research.google.com/github/rmcpantoja/My-Colab-Notebooks/blob/main/notebooks/OpenAI%20Whisper%20-%20DotCSV%20(Speech%20dataset%20multi-transcryption%20support)es.ipynb)