***This code is GPU-enabled.***

# Installing libraries

*For installing PyTorch with GPU, it will depend on the CUDA version installed, refer to https://pytorch.org/get-started/locally/. Also refer here for a CPU one.*

When using magic commands, it's `%` for VS Code but `!` for local Jupyter notebook/Google Colab. It's better to use these commands to avoid dependency conflicts.

In [1]:
# # Installing the required packages
# %pip install numpy librosa soundfile transformers tf-keras

In [2]:
## Install ffmpeg
# %pip install python-ffmpeg

In [3]:
# # Install torch (CPU)
# %pip install torch torchvision torchaudio

*The cell above is for CPU, for GPU usage, refer to this link: https://www.youtube.com/watch?v=NrJz3ACosJA&ab_channel=LearnwithZORO*

*Tested on **Windows**, not sure if GPU-utilization would work on Mac, but best to opt for CPU in the meantime.*

In [4]:
# # Installing deepfilternet
# %pip install deepfilterlib
# %pip install deepfilternet

*For `deepfilternet`, you need to install **Visual Studio** (as well as additional Visual Studio Packages, not just Visual Studio Code). You also need to install libraries such as `deepfilterlib` and `ffmpeg-python`, as they are all dependencies of `deepfilternet` and may not be included in `pip` installation.*

In [5]:
import torch, os
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # GPU model

True
NVIDIA GeForce RTX 3050 Laptop GPU


If the first line indicates `True`, it means `torch` detected a GPU that it will use, and the second line indicates the specific graphics card.

In [6]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Pipeline

The cell below imports necessary libraries. This includes those needed for audio-related Python tasks (`loguru`, `librosa`, `soundfile`, `ffmpeg`, `deepfilternet`/`df`), and `transformers` from Huggingface (for models).

In [9]:
%pip install ffmpeg-python python-ffmpeg

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.comNote: you may need to restart the kernel to use updated packages.

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting python-ffmpeg
  Downloading python_ffmpeg-2.0.12-py3-none-any.whl.metadata (3.2 kB)
Collecting future (from ffmpeg-python)
  Downloading future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Downloading python_ffmpeg-2.0.12-py3-none-any.whl (14 kB)
Downloading future-1.0.0-py3-none-any.whl (491 kB)
Installing collected packages: future, python-ffmpeg, ffmpeg-python
Successfully installed ffmpeg-python-0.2.0 future-1.0.0 python-ffmpeg-2.0.12


In [10]:
import warnings, librosa, soundfile as sf, gc, glob, pandas as pd, ffmpeg, numpy as np
from df import enhance, init_df
from transformers import pipeline, AutoModelForAudioClassification, AutoConfig

warnings.filterwarnings("ignore")

  from torchaudio.backend.common import AudioMetaData
  from .autonotebook import tqdm as notebook_tqdm


## Multiple audios

In [11]:
def enhance_audio_files(input_folder, output_folder):
    """
    Enhances audio files in the input_folder and saves them in output_folder.
    
    Parameters:
    - input_folder (str): Directory containing the original audio files.
    - output_folder (str): Directory where enhanced audio files will be saved.
    """
    os.makedirs(output_folder, exist_ok=True)

    # Initialize the model and its state once outside the loop
    model, df_state, _ = init_df()  # Load default model

    # Process each audio file in the input folder
    for file_path in glob.glob(os.path.join(input_folder, '*')):
        try:
            print(f"Processing {file_path} ...")
            y, sr = librosa.load(file_path, sr=None)  # Load audio file

            # Convert to PyTorch tensor
            y_tensor = torch.from_numpy(y).float().unsqueeze(0)

            # Enhance the audio
            enhanced_audio = enhance(model, df_state, y_tensor)

            # Convert back to NumPy array
            if isinstance(enhanced_audio, torch.Tensor):
                enhanced_audio_np = enhanced_audio.cpu().detach().numpy()
            else:
                enhanced_audio_np = enhanced_audio

            # Remove extra batch dimension if present
            if enhanced_audio_np.ndim > 1 and enhanced_audio_np.shape[0] == 1:
                enhanced_audio_np = enhanced_audio_np[0]

            # Prepare output file name
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            output_file = os.path.join(output_folder, base_name + '.wav')

            # Save the enhanced audio
            sf.write(output_file, enhanced_audio_np, sr)
            print(f"Enhanced audio saved to {output_file}\n")

        except Exception as e:
            print(f"Error processing {file_path}: {e}")

In [12]:
# Execute
enhance_audio_files(input_folder="../data/With Backgorund Noise/Cleared", output_folder="../data/Enhanced/With Background Noise")

[32m2025-02-23 20:06:21[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on torch 2.6.0+cu126[0m
[32m2025-02-23 20:06:21[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on host LAPTOP-5IMR3DTG[0m


[32m2025-02-23 20:06:22[0m | [1mINFO    [0m | [36mDF[0m | [1mLoading model settings of DeepFilterNet3[0m
[32m2025-02-23 20:06:22[0m | [1mINFO    [0m | [36mDF[0m | [1mUsing DeepFilterNet3 model at C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3[0m
[32m2025-02-23 20:06:22[0m | [1mINFO    [0m | [36mDF[0m | [1mInitializing model `deepfilternet3`[0m
[32m2025-02-23 20:06:24[0m | [1mINFO    [0m | [36mDF[0m | [1mFound checkpoint C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3\checkpoints\model_120.ckpt.best with epoch 120[0m
[32m2025-02-23 20:06:24[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on device cuda:0[0m
[32m2025-02-23 20:06:24[0m | [1mINFO    [0m | [36mDF[0m | [1mModel loaded[0m
Processing ../data/With Backgorund Noise/Cleared\https___www.tiktok.com__31milmovement_video_7356107475910659346.mp3 ...
Enhanced audio saved to ../data/Enhanced/With Background Noise\https___www.tiktok.com_

In [13]:
enhance_audio_files(input_folder="../data/Mixed Interviewer and Speaker", output_folder="../data/Enhanced/Mixed Interviewer and Speaker")

[32m2025-02-23 20:08:53[0m | [1mINFO    [0m | [36mDF[0m | [1mLoading model settings of DeepFilterNet3[0m
[32m2025-02-23 20:08:53[0m | [1mINFO    [0m | [36mDF[0m | [1mUsing DeepFilterNet3 model at C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3[0m
[32m2025-02-23 20:08:53[0m | [1mINFO    [0m | [36mDF[0m | [1mInitializing model `deepfilternet3`[0m
[32m2025-02-23 20:08:53[0m | [1mINFO    [0m | [36mDF[0m | [1mFound checkpoint C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3\checkpoints\model_120.ckpt.best with epoch 120[0m
[32m2025-02-23 20:08:54[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on device cuda:0[0m
[32m2025-02-23 20:08:54[0m | [1mINFO    [0m | [36mDF[0m | [1mModel loaded[0m
Processing ../data/Mixed Interviewer and Speaker\https___www.tiktok.com__asiatoday111_video_7344992449472761094.mp3 ...
Enhanced audio saved to ../data/Enhanced/Mixed Interviewer and Speaker\https___www.tikt

# Models (emotions)

## Downloading

In [14]:
models = [
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition",
    "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3",
    "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
]

# Loop through each model and print its emotion classes
model_data = []
for model_name in models:
    try:
        config = AutoConfig.from_pretrained(model_name)
        emotions = list(config.id2label.values())  # Extract emotion classes
        model_data.append({"Model": model_name, "Emotions": emotions})
    except Exception as e:
        print(f"Error loading model {model_name}: {e}")
        model_data.append({"Model": model_name, "Emotions": "Error loading emotions"})

# Create DataFrame
df_models = pd.DataFrame(model_data)

# Display the expanded DataFrame
display(df_models)

Unnamed: 0,Model,Emotions
0,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,"[angry, calm, disgust, fearful, happy, neutral..."
1,firdhokk/speech-emotion-recognition-with-opena...,"[angry, disgust, fearful, happy, neutral, sad,..."
2,firdhokk/speech-emotion-recognition-with-faceb...,"[angry, disgust, fearful, happy, neutral, sad,..."


## Running/evaluating the model

### GPU Usage

The code below is dedicated for **GPU usage**, including a function that will automatically clear caches after every audio processed to avoid GPU running out of memory.

In [15]:
audio_folder = r'../data/Enhanced/Mixed Interviewer and Speaker'
audio_files = [f for f in os.listdir(audio_folder) if f.endswith(".wav")]

device = 0
print(f"Device: {'GPU' if device == 0 else 'CPU'}")

all_data = []

for i, model_name in enumerate(models):
    print(f"Processing with model {i + 1}: {model_name}")
    try:
        emotion_pipeline = pipeline("audio-classification", model=model_name, device=device)

        for audio_file in audio_files:
            audio_path = os.path.join(audio_folder, audio_file)
            try:
                results = emotion_pipeline(audio_path)
                for result in results:
                    all_data.append({
                        "File": audio_file,
                        "Model": model_name,
                        "Emotion": result["label"],
                        "Score": result["score"]
                    })
            except RuntimeError as e:
                if "CUDA out of memory" in str(e):
                    print("Out of memory! Switching to CPU.")
                    emotion_pipeline = pipeline("audio-classification", model=model_name, device=-1)
                    results = emotion_pipeline(audio_path)
                    for result in results:
                        all_data.append({
                            "File": audio_file,
                            "Model": model_name,
                            "Emotion": result["label"],
                            "Score": result["score"]
                        })

    except Exception as e:
        print(f"Error with model {model_name}: {e}")

    # Cleanup
    del emotion_pipeline
    torch.cuda.empty_cache()
    gc.collect()

df = pd.DataFrame(all_data)

Device: GPU
Processing with model 1: ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition



Some weights of the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition were not used when initializing Wav2Vec2ForSequenceClassification: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.output.bias', 'classifier.output.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition and are newly initialized: ['classifier.bias', 'classifier.weight', '

Processing with model 2: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Processing with model 3: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53


In [16]:
df

Unnamed: 0,File,Model,Emotion,Score
0,https___www.tiktok.com__asiatoday111_video_734...,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,disgust,0.134748
1,https___www.tiktok.com__asiatoday111_video_734...,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,calm,0.129843
2,https___www.tiktok.com__asiatoday111_video_734...,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,fearful,0.127523
3,https___www.tiktok.com__asiatoday111_video_734...,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,surprised,0.125857
4,https___www.tiktok.com__asiatoday111_video_734...,ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-...,angry,0.124464
...,...,...,...,...
115,https___www.tiktok.com__politiko_ph_video_7356...,firdhokk/speech-emotion-recognition-with-faceb...,happy,0.999698
116,https___www.tiktok.com__politiko_ph_video_7356...,firdhokk/speech-emotion-recognition-with-faceb...,angry,0.000165
117,https___www.tiktok.com__politiko_ph_video_7356...,firdhokk/speech-emotion-recognition-with-faceb...,surprised,0.000075
118,https___www.tiktok.com__politiko_ph_video_7356...,firdhokk/speech-emotion-recognition-with-faceb...,disgust,0.000035


Do the same model for applying into raw audios to see if there is a difference in emotion detection.