<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/AI_applications_for_Audio/AI_applications_for_Audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI applications for Audio

![](https://content.cleanvoice.ai/uploads/large_Copy_of_Add_a_heading_6_1_39ab1d9f83.png)

Source: [content.cleanvoice.ai](https://content.cleanvoice.ai/uploads/large_Copy_of_Add_a_heading_6_1_39ab1d9f83.png)

## Housekeeping

* Check that the recording is on
* Check audio and screenshare
* Share link to notebook in chat
* Light mode and readable font size
* GPU runtime and `run-all`

Description of audio AI systems.

Use cases, platforms, use-cases

# Basic Audio Tools

## Converting between audio formats with **ffmpeg**

In [None]:
# Tools for processing audio files:
!apt install ffmpeg

In [None]:
# Get some audio files
!wget -O mary.mp3 https://raw.githubusercontent.com/petewarden/openai-whisper-webapp/main/mary.mp3
!wget -O daisy_HAL_9000.mp3 https://raw.githubusercontent.com/petewarden/openai-whisper-webapp/main/daisy_HAL_9000.mp3
!wget -O AllStar.mp3 https://raw.githubusercontent.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial/main/AllStar.mp3
!wget -O Cupid_Fifty_Fifty_Korean_Version.mp3 https://raw.githubusercontent.com/keatonkraiger/Whisper-Transcribe-and-Translate-Tutorial/main/Cupid_Fifty_Fifty_Korean_Version.mp3


In [None]:
# Convert format:
!ffmpeg -i mary.mp3 mary.wav

In [None]:
from IPython.display import Audio
Audio("/content/mary.wav")

In [None]:
# change the volume of an audio file
!ffmpeg -i AllStar.mp3 -af 'volume=0.5' AllStar_edited.mp3

In [None]:
from IPython.display import Audio
Audio("/content/AllStar_edited.mp3")

In [None]:
# Compress audio
!ffmpeg -i AllStar.mp3 -ab 128 AllStar_edited.mp3

In [None]:
from IPython.display import Audio
Audio("/content/AllStar_edited.mp3")

In [None]:
# trim audio
!ffmpeg -i Cupid_Fifty_Fifty_Korean_Version.mp3 \
 -ss 00:01:54 -to 00:06:53 \
 -c copy Cupid_Fifty_Fifty_Korean_Version_edited.mp3

In [None]:
from IPython.display import Audio
Audio("/content/Cupid_Fifty_Fifty_Korean_Version.mp3")

In [None]:
# Split audio file
! ffmpeg -i Cupid_Fifty_Fifty_Korean_Version.mp3\
 -t 00:00:30 -c copy Cupid_Fifty_Fifty_Korean_Version_part1.mp4
 -ss 00:00:30 -codec copy Cupid_Fifty_Fifty_Korean_Version_part2.mp4


In [None]:
from IPython.display import Audio
Audio("/content/Cupid_Fifty_Fifty_Korean_Version_part1.mp3")

## Generating audio chord progressions in Python

ToDo: explain MIDI format

In [None]:
!pip install mido

Collecting mido
  Downloading mido-1.3.3-py3-none-any.whl.metadata (6.4 kB)
Downloading mido-1.3.3-py3-none-any.whl (54 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mido
Successfully installed mido-1.3.3


In [None]:
# Function to get chord notes in root position
def get_chord_notes(chord):
    chord_map = {
        "C": [0, 4, 7], "Cm": [0, 3, 7],
        "D": [2, 6, 9], "Dm": [2, 5, 9],
        "E": [4, 8, 11], "Em": [4, 7, 11],
        "F": [5, 9, 12], "Fm": [5, 8, 12],
        "G": [7, 11, 14], "Gm": [7, 10, 14],
        "A": [9, 13, 16], "Am": [9, 12, 16],
        "B": [11, 15, 18], "Bm": [11, 14, 18]
    }

    root = 48  # Starting at C3
    if chord in chord_map:
        return [root + interval for interval in chord_map[chord]]
    else:
        raise ValueError(f"Unknown chord: {chord}")


In [None]:
# User-provided chord progression
# Modify this to change progression
user_chords = ["Em", "Em", "Em", "Em",
               "G", "G", "G", "G",
               "D", "D", "D", "D",
               "A", "A", "A", "A"]

# Choose how many times to repeat progression:
progression_repeat = 4

# Choose instrument (MIDI program numbers: 0-127)
# Some common instruments:
# 0  = Acoustic Grand Piano
# 24 = Nylon String Guitar
# 33 = Electric Bass (finger)
# 40 = Violin
# 56 = Trumpet
# 73 = Flute
instrument = 73  # Change this to select an instrument

#Choose file name:
midi_path = "basic_progression_flute.mid"


In [None]:
from mido import Message, MidiFile, MidiTrack

# Create a new MIDI file and track
midi = MidiFile()
track = MidiTrack()
midi.tracks.append(track)

# Set tempo (120 BPM)
track.append(Message('program_change', program=0, time=0))  # Set to Acoustic Grand Piano

# Add chords to the track (each lasting 1 second)
tick_duration = 480  # Standard tick duration for quarter notes

# Repeat the progression:
for _ in range(progression_repeat):
    for chord in user_chords:
        notes = get_chord_notes(chord)
        for note in notes:
            track.append(Message('note_on', note=note, velocity=64, time=0))
        track.append(Message('note_off', note=notes[0], velocity=64, time=tick_duration))
        for note in notes[1:]:
            track.append(Message('note_off', note=note, velocity=64, time=0))

# Save the MIDI file
midi.save(midi_path)


Listen to the MIDI file here: [https://midiplayer.ehubsoft.net/](https://midiplayer.ehubsoft.net/)

# Application 1: Music Generation



## Sound Draw


Web-based music generator with a free demo, but needs an account for downloads:

[https://soundraw.io/generate_music](https://soundraw.io/generate_music)

## MusicGen by Meta

MusicGen generaties music samples using text descriptions or audio prompts.

See full demo here:
https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/music-generation/music-generation.ipynb


In [None]:
%pip install -q "torch>=2.1" "gradio>=4.19" "transformers" packaging --extra-index-url https://download.pytorch.org/whl/cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from collections import namedtuple
from functools import partial
import gc
from pathlib import Path
from typing import Optional, Tuple
import warnings

from IPython.display import Audio
import numpy as np
import torch
from torch.jit import TracerWarning
from transformers import AutoProcessor, MusicgenForConditionalGeneration
from transformers.modeling_outputs import (
    BaseModelOutputWithPastAndCrossAttentions,
    CausalLMOutputWithCrossAttentions,
)

# Ignore tracing warnings
warnings.filterwarnings("ignore", category=TracerWarning)



In [None]:
import sys
from packaging.version import parse


if sys.version_info < (3, 8):
    import importlib_metadata
else:
    import importlib.metadata as importlib_metadata
loading_kwargs = {}

if parse(importlib_metadata.version("transformers")) >= parse("4.40.0"):
    loading_kwargs["attn_implementation"] = "eager"


# Load the pipeline
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small", torchscript=True, return_dict=False, **loading_kwargs)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/7.87k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summ

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

In [None]:
# Edit the prompt here to test a different audio:
# text_prompt = ["80s pop track with bassy drums and synth"]
text_prompt = ["catchy show tune"]

# change the duration of the music here:
sample_length = 4  # seconds

In [None]:
## Setting sampling rate:
n_tokens = sample_length * model.config.audio_encoder.frame_rate + 3
sampling_rate = model.config.audio_encoder.sampling_rate
print("Sampling rate is", sampling_rate, "Hz")

model.to("cpu")
model.eval();

Sampling rate is 32000 Hz


In [None]:
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=text_prompt,
    return_tensors="pt",
)

audio_values = model.generate(**inputs, do_sample=True,
                              guidance_scale=3,
                              max_new_tokens=n_tokens)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

# Application 2: Musical Transcription


## Generating Sheet Music with Songcription AI

For this demo, we will first generate the MIDI file for a simple piano chord progression below. You can use any single-instrument audio file instead.

Once you are ready with your audio file, go to the url: [https://www.songscription.ai/](https://www.songscription.ai/)

# Application 3: Noise Reduction and Identification


An open-source and all-in-one conversational AI toolkit based on PyTorch.

Code source: https://colab.research.google.com/github/speechbrain/speechbrain/blob/develop/docs/tutorials/basics/what-can-i-do-with-speechbrain.ipynb#scrollTo=PuVNyffAhVfx

In [None]:
%%capture
# Installing SpeechBrain via pip
BRANCH = 'develop'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH


In [None]:
import speechbrain as sb
from speechbrain.dataio.dataio import read_audio
from IPython.display import Audio
from speechbrain.inference.separation import SepformerSeparation as separator
import torchaudio

In [None]:
!wget -O example_whamr.wav "https://www.dropbox.com/scl/fi/gxbtbf3c3hxr0y9dbf0nw/example_whamr.wav?rlkey=1wt5d49kjl36h0zypwrmsy8nz&dl=1"
!wget -O voice_sample.mp3 "https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/AI_applications_for_Audio/voice_sample.mp3"
!wget -O voice_sample.mp3 "https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/AI_applications_for_Audio/voice_sample_noisy.mp3"


In [None]:
model = separator.from_hparams(source="speechbrain/sepformer-whamr-enhancement",
                               savedir='pretrained_models/sepformer-whamr-enhancement4')

enhanced_speech = model.separate_file(path='/content/example_whamr.wav')
# enhanced_speech2 = model.separate_file(path='/content/voice_sample_noisy.mp3')

In [None]:
signal = read_audio("/content/example_whamr.wav").squeeze()
Audio(signal, rate=8000)

In [None]:
Audio(enhanced_speech[:, :].detach().cpu().squeeze(), rate=8000)

In [None]:
# signal = read_audio("/voice_sample_noisy.mp3").squeeze()
# Audio(signal, rate=8000)

In [None]:
# Audio(enhanced_speech2[:, :].detach().cpu().squeeze(), rate=8000)

# Application 4: Voice Cloning

ToDo: add Coqui models

In [None]:
!pip install TTS
!wget -O sample.m4a https://github.com/ua-datalab/Generative-AI/raw/refs/heads/enoriega/langchain/Notebooks/sample.m4a

In [None]:
# Auto-accept Coqui model downloads
# you can comment this line, and manually click 'y' when you see the question.
import os
os.environ["COQUI_TOS_AGREED"] = "1"


# Loading a multilingual model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# tts = TTS("tts_models/multilingual/multi-dataset/your_tts")

In [None]:
sample_audio = "/content/sample.m4a"
sample_audio2 = "/content/voice_sample.mp3"

tts.tts_to_file("This is voice cloning.",
                speaker_wav=sample_audio,
                language="en",
                file_path="output.wav")



In [None]:
display(Audio('output_tts.wav', autoplay=True))

# Application 5: Notebook LM

ToDo: add description

Link: [https://notebooklm.google.com/notebook/15b4c6cc-4513-4e8b-bff5-8b630786396d?authuser=1&pli=1](https://notebooklm.google.com/notebook/15b4c6cc-4513-4e8b-bff5-8b630786396d?authuser=1&pli=1)

# References and resources:

- https://vibertthio.com/sornting/
- https://github.com/Curated-Awesome-Lists/awesome-ai-music-generation?tab=readme-ov-file
- https://ostechnix.com/20-ffmpeg-commands-beginners/
- https://umatechnology.org/how-to-use-ffmpeg-commands-for-audio-and-video-processing-on-linux/