***This code is GPU-enabled.***

Python version used: **3.12.3**

# Installing libraries

*For installing PyTorch with GPU, it will depend on the CUDA version installed, refer to https://pytorch.org/get-started/locally/. Also refer here for a CPU one.*

When using magic commands, it's `%` for VS Code but `!` for local Jupyter notebook/Google Colab. It's better to use these commands to avoid dependency conflicts.

In [1]:
# # Installing the required packages
# %pip install numpy librosa soundfile transformers tf-keras

In [2]:
## Install ffmpeg
# %pip install python-ffmpeg

In [3]:
# # Install torch (CPU)
# %pip install torch torchvision torchaudio

*The cell above is for CPU, for GPU usage, refer to this link: https://www.youtube.com/watch?v=NrJz3ACosJA&ab_channel=LearnwithZORO*

*Tested on **Windows**, not sure if GPU-utilization would work on Mac, but best to opt for CPU in the meantime.*

In [4]:
# # Installing deepfilternet
# %pip install deepfilterlib
# %pip install deepfilternet

*For `deepfilternet`, you need to install **Visual Studio** (as well as additional Visual Studio Packages, not just Visual Studio Code). You also need to install libraries such as `deepfilterlib` and `ffmpeg-python`, as they are all dependencies of `deepfilternet` and may not be included in `pip` installation.*

In [5]:
import torch, os
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # GPU model

True
NVIDIA GeForce RTX 3050 Laptop GPU


If the first line indicates `True`, it means `torch` detected a GPU that it will use, and the second line indicates the specific graphics card.

In [6]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Pipeline

The cell below imports necessary libraries. This includes those needed for audio-related Python tasks (`loguru`, `librosa`, `soundfile`, `ffmpeg`, `deepfilternet`/`df`), and `transformers` from Huggingface (for models).

In [7]:
# %pip install ffmpeg-python python-ffmpeg

In [8]:
import warnings, librosa, soundfile as sf, gc, glob, pandas as pd, ffmpeg, numpy as np
from df import enhance, init_df
from transformers import pipeline, AutoModelForAudioClassification, AutoConfig

warnings.filterwarnings("ignore")

  from torchaudio.backend.common import AudioMetaData
  from .autonotebook import tqdm as notebook_tqdm


## Multiple audios

In [9]:
def enhance_audio_files(input_folder, output_folder):
    """
    Enhances audio files in the input_folder and saves them in output_folder.
    
    Parameters:
    - input_folder (str): Directory containing the original audio files.
    - output_folder (str): Directory where enhanced audio files will be saved.
    """
    os.makedirs(output_folder, exist_ok=True)

    # Initialize the model and its state once outside the loop
    model, df_state, _ = init_df()  # Load default model

    # Process each audio file in the input folder
    for file_path in glob.glob(os.path.join(input_folder, '*')):
        try:
            print(f"Processing {file_path} ...")
            y, sr = librosa.load(file_path, sr=None)  # Load audio file

            # Convert to PyTorch tensor
            y_tensor = torch.from_numpy(y).float().unsqueeze(0)

            # Enhance the audio
            enhanced_audio = enhance(model, df_state, y_tensor)

            # Convert back to NumPy array
            if isinstance(enhanced_audio, torch.Tensor):
                enhanced_audio_np = enhanced_audio.cpu().detach().numpy()
            else:
                enhanced_audio_np = enhanced_audio

            # Remove extra batch dimension if present
            if enhanced_audio_np.ndim > 1 and enhanced_audio_np.shape[0] == 1:
                enhanced_audio_np = enhanced_audio_np[0]

            # Prepare output file name
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            output_file = os.path.join(output_folder, base_name + '.wav')

            # Save the enhanced audio
            sf.write(output_file, enhanced_audio_np, sr)
            print(f"Enhanced audio saved to {output_file}\n")

        except Exception as e:
            print(f"Error processing {file_path}: {e}")

In [10]:
def convert_audio_files_to_wav(input_folder, output_folder):
    """
    Converts audio files in the input_folder to WAV format and saves them in output_folder.
    
    Parameters:
    - input_folder (str): Directory containing the original audio files.
    - output_folder (str): Directory where converted audio files will be saved.
    """
    os.makedirs(output_folder, exist_ok=True)

    # Process each audio file in the input folder
    for file_path in glob.glob(os.path.join(input_folder, '*')):
        try:
            print(f"Processing {file_path} ...")
            y, sr = librosa.load(file_path, sr=None)  # Load audio file

            # Prepare output file name
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            output_file = os.path.join(output_folder, base_name + '.wav')

            # Save the audio in WAV format
            sf.write(output_file, y, sr)
            print(f"Converted audio saved to {output_file}\n")

        except Exception as e:
            print(f"Error processing {file_path}: {e}")

In [11]:
# Execute
# enhance_audio_files(input_folder="../data/With Backgorund Noise/Cleared", output_folder="../data/Enhanced/With Background Noise")

In [12]:
# enhance_audio_files(input_folder="../data/Mixed Interviewer and Speaker", output_folder="../data/Enhanced/Mixed Interviewer and Speaker")

# Models (emotions)

## Downloading

In [13]:
models = [
    "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3",
    "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
]

# Loop through each model and print its emotion classes
model_data = []
for model_name in models:
    try:
        config = AutoConfig.from_pretrained(model_name)
        emotions = list(config.id2label.values())  # Extract emotion classes
        model_data.append({"Model": model_name, "Emotions": emotions})
    except Exception as e:
        print(f"Error loading model {model_name}: {e}")
        model_data.append({"Model": model_name, "Emotions": "Error loading emotions"})

# Create DataFrame
df_models = pd.DataFrame(model_data)

# Display the expanded DataFrame
display(df_models)

Unnamed: 0,Model,Emotions
0,firdhokk/speech-emotion-recognition-with-opena...,"[angry, disgust, fearful, happy, neutral, sad,..."
1,firdhokk/speech-emotion-recognition-with-faceb...,"[angry, disgust, fearful, happy, neutral, sad,..."


## Running/evaluating the model

### GPU Usage

The code below is dedicated for **GPU usage**, including a function that will automatically clear caches after every audio processed to avoid GPU running out of memory.

### Videos from Sen Risa Hontiveros

In [14]:
# enhance_audio_files(input_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros", output_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\enhanced")
# convert_audio_files_to_wav(input_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros", output_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\waved")

In [15]:
for folder in glob.glob(r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\*"):
    enhance_audio_files(input_folder=folder, output_folder=folder + r"\enhanced")
    convert_audio_files_to_wav(input_folder=folder, output_folder=folder + r"\waved")

[32m2025-03-01 20:46:35[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on torch 2.6.0+cu126[0m
[32m2025-03-01 20:46:35[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on host LAPTOP-5IMR3DTG[0m
[32m2025-03-01 20:46:35[0m | [1mINFO    [0m | [36mDF[0m | [1mLoading model settings of DeepFilterNet3[0m
[32m2025-03-01 20:46:35[0m | [1mINFO    [0m | [36mDF[0m | [1mUsing DeepFilterNet3 model at C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3[0m
[32m2025-03-01 20:46:35[0m | [1mINFO    [0m | [36mDF[0m | [1mInitializing model `deepfilternet3`[0m
[32m2025-03-01 20:46:36[0m | [1mINFO    [0m | [36mDF[0m | [1mFound checkpoint C:\Users\Ebo\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3\checkpoints\model_120.ckpt.best with epoch 120[0m
[32m2025-03-01 20:46:36[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on device cuda:0[0m
[32m2025-03-01 20:46:36[0m | [1mINFO    [0m | [36mDF[0m | [1mModel loaded[0m
Pr

The cell above would produce a new folder/directory (enhanced from `enhance_audio_files` and waved from `convert_audio_files_to_wav`). `enhanced` includes audios processed with deepfilterlib (bg noise removed) while `waved` only includes audios converted to wav.

In [16]:
def process_audio_with_model(audio_folder, model_name, device=0):
    """
    Process audio files in the specified folder with the given model and return the results as a DataFrame.
    
    Parameters:
    - audio_folder (str): Directory containing the audio files.
    - model_name (str): Name of the model to use for emotion recognition.
    - device (int): Device to use for processing (0 for GPU, -1 for CPU).
    
    Returns:
    - pd.DataFrame: DataFrame containing the results.
    """
    audio_files = [f for f in os.listdir(audio_folder) if f.endswith(".wav")]
    print(f"Device: {'GPU' if device == 0 else 'CPU'}")

    all_data = []

    print(f"Processing with model: {model_name}")
    try:
        emotion_pipeline = pipeline("audio-classification", model=model_name, device=device)

        for audio_file in audio_files:
            audio_path = os.path.join(audio_folder, audio_file)
            try:
                results = emotion_pipeline(audio_path)
                result_dict = {"File": audio_file, "Model": model_name}
                for result in results:
                    result_dict[result["label"]] = result["score"]
                all_data.append(result_dict)
            except RuntimeError as e:
                if "CUDA out of memory" in str(e):
                    print("Out of memory! Switching to CPU.")
                    emotion_pipeline = pipeline("audio-classification", model=model_name, device=-1)
                    results = emotion_pipeline(audio_path)
                    result_dict = {"File": audio_file, "Model": model_name}
                    for result in results:
                        result_dict[result["label"]] = result["score"]
                    all_data.append(result_dict)

    except Exception as e:
        print(f"Error with model {model_name}: {e}")

    # Cleanup
    del emotion_pipeline
    torch.cuda.empty_cache()
    gc.collect()

    return pd.DataFrame(all_data)

The code below already processes all audios under `enhanced` subfolder in each folders within `folis` (folder name in path) and concatenates all results separated by model (one dataframe = one model).

In [19]:
# Do for all folders
all_results_model1 = []

for folder in glob.glob(r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\*"):
    df_model1 = process_audio_with_model(
        audio_folder=folder + r"\enhanced",
        model_name="firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
    )
    all_results_model1.append(df_model1)

# Concatenate all results into one DataFrame
df_all_model1 = pd.concat(all_results_model1, ignore_index=True)

# Save the concatenated DataFrame to a CSV file
df_all_model1.to_csv(r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\results\all_results_model1.csv", index=False)

Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3


The code below already processes all audios under `waved` (raw audio) subfolder in each folders within `folis` (folder name in path) and concatenates all results separated by model (one dataframe = one model).

In [None]:
# Do for all folders
all_results_model2 = []

for folder in glob.glob(r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\*"):
    df_model2 = process_audio_with_model(
        audio_folder=folder + r"\waved",
        model_name="firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
    )
    all_results_model2.append(df_model2)

# Concatenate all results into one DataFrame
df_all_model2 = pd.concat(all_results_model2, ignore_index=True)

# Save the concatenated DataFrame to a CSV file
df_all_model2.to_csv(r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\results\all_results_model2.csv", index=False)

Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53

Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53


## Sample (if single folder)

In [None]:
# # Process audio files with the first model
# df_model1_1 = process_audio_with_model(
#     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\enhanced",
#     model_name="firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
# )

# # Process audio files with the second model
# df_model2_1 = process_audio_with_model(
#     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\enhanced",
#     model_name="firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
# )

Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53



Unnamed: 0,File,Model,angry,happy,neutral,fearful,surprised
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.914166,0.07214,0.005228,0.004229,0.002429
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.226042,0.744771,0.00506,0.009099,0.013111
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.962185,0.015838,0.007485,0.002069,0.010302
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.124859,0.074972,0.019865,0.006124,0.770732


Unnamed: 0,File,Model,happy,surprised,angry,fearful,disgust,sad
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.989997,0.006588,0.003113,0.000267,2.6e-05,
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.991388,0.001133,0.00737,8.4e-05,2e-05,
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.962523,0.005589,0.006437,,0.025076,0.000286
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.996886,0.000393,0.002306,0.000263,0.00012,


In [None]:
# # Process audio files with the first model
# df_model1_1 = process_audio_with_model(
#     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\enhanced",
#     model_name="firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
# )

# # Process audio files with the second model
# df_model2_1 = process_audio_with_model(
#     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\enhanced",
#     model_name="firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
# )

In [None]:
# display(df_model1_1)
# display(df_model2_1)

Unnamed: 0,File,Model,angry,happy,neutral,fearful,surprised
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.914166,0.07214,0.005228,0.004229,0.002429
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.226042,0.744771,0.00506,0.009099,0.013111
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.962185,0.015838,0.007485,0.002069,0.010302
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.124859,0.074972,0.019865,0.006124,0.770732


Unnamed: 0,File,Model,happy,surprised,angry,fearful,disgust,sad
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.989997,0.006588,0.003113,0.000267,2.6e-05,
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.991388,0.001133,0.00737,8.4e-05,2e-05,
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.962523,0.005589,0.006437,,0.025076,0.000286
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.996886,0.000393,0.002306,0.000263,0.00012,


In [None]:
# def process_audio_with_model(audio_folder, model_name, device=0):
#     """
#     Process audio files in the specified folder with the given model and return the results as a DataFrame.
    
#     Parameters:
#     - audio_folder (str): Directory containing the audio files.
#     - model_name (str): Name of the model to use for emotion recognition.
#     - device (int): Device to use for processing (0 for GPU, -1 for CPU).
    
#     Returns:
#     - pd.DataFrame: DataFrame containing the results.
#     """
#     audio_files = [f for f in os.listdir(audio_folder) if f.endswith(".wav")]
#     print(f"Device: {'GPU' if device == 0 else 'CPU'}")

#     all_data = []

#     print(f"Processing with model: {model_name}")
#     try:
#         emotion_pipeline = pipeline("audio-classification", model=model_name, device=device)

#         for audio_file in audio_files:
#             audio_path = os.path.join(audio_folder, audio_file)
#             try:
#                 results = emotion_pipeline(audio_path)
#                 result_dict = {"File": audio_file, "Model": model_name}
#                 for result in results:
#                     result_dict[result["label"]] = result["score"]
#                 all_data.append(result_dict)
#             except RuntimeError as e:
#                 if "CUDA out of memory" in str(e):
#                     print("Out of memory! Switching to CPU.")
#                     emotion_pipeline = pipeline("audio-classification", model=model_name, device=-1)
#                     results = emotion_pipeline(audio_path)
#                     result_dict = {"File": audio_file, "Model": model_name}
#                     for result in results:
#                         result_dict[result["label"]] = result["score"]
#                     all_data.append(result_dict)

#     except Exception as e:
#         print(f"Error with model {model_name}: {e}")

#     # Cleanup
#     del emotion_pipeline
#     torch.cuda.empty_cache()
#     gc.collect()

#     return pd.DataFrame(all_data)

# # Process audio files with the first model
# df_model1_wav = process_audio_with_model(
#     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\waved",
#     model_name="firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
# )

# # # Process audio files with the second model
# # df_model2_wav = process_audio_with_model(
# #     audio_folder=r"C:\Users\Ebo\Projetos Personais\Rappler\Audio-Analysis\data\NEWNEW\clean\folis\Risa Hontiveros\waved",
# #     model_name="firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
# # )

# # # Display the DataFrames
# # display(df_model1_wav)
# # display(df_model2_wav)

Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3
Device: GPU
Processing with model: firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53


Unnamed: 0,File,Model,angry,happy,neutral,fearful,surprised
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.927768,0.060046,0.004754,0.003611,0.002088
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.555325,0.422694,0.005575,0.007901,0.006355
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.977474,0.013198,0.003338,0.001563,0.002966
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-opena...,0.254214,0.28349,0.031496,0.017599,0.40641


Unnamed: 0,File,Model,happy,surprised,angry,fearful,disgust,sad
0,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.983964,0.012395,0.003276,0.000322,3.2e-05,
1,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.998551,0.000735,0.000658,4.5e-05,9e-06,
2,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.997095,0.001171,0.000554,,0.001092,6.5e-05
3,https___www.tiktok.com__senrisahontiveros_vide...,firdhokk/speech-emotion-recognition-with-faceb...,0.999091,0.00014,0.000659,3.3e-05,6.2e-05,
