# Deep Learning for Computer Vision: Final Project

## Computer Science: COMS W 4995 011

### Proposal: Due November 7, 2023
### Presentations: Due December 5 and 7, 2023
### Final Report: Due December 8, 2023

### Project Overview

The final project is one of the most important and, hopefully, exciting components of the course. You will have the opportunity to develop a deep learning system of your own choosing. 
You are free to select whatever framework (Pytorch, Tensorflow, etc.) you like, but you need create a report on your project in a Jupyter notebook. You are also free build on publically available models and code, but your report must clearly give attribution for the work of others and must clearly delineate your contributions. Also, half of the class will present their project during the last 2 days. All of the class will prepare videos of their presentation and submit these when the final report is due. 

### Project Proposal

The project description should include the title of the project, participants, a description of the objectives of the project, and a plan for how the project will be completed. The description of the objectives should include modest predictions of the success of the project. The plan for completion should include a description of the training data and how it will be obtained, a discussion of what deep learning framework will be used and why, and a rough description of the planned network architecture.

You are permitted to work together on a project in groups of two or three, but group size must not exceed three participants.  For group projects there must be a clearly delineated division of labor: you should state in the project description and project report who was responsible for which portion of the project. Each student must hand in a separate report. (Students will not necessarily get the same grade for the same project.)

You should mention whether you are simply re-implementing what others have done before but applying to new data or whether you are attempting to do something new to the best of your knowledge. Creative and original projects will be judged more kindly than those that are rehashing something in the existing literature. And projects that include a component in which data is acquired/curated into training and validation sets will be veiwed more favorably than those that simply download an existing data set such as CIFAR-100.

As this is a computer vision course it is expected that your data will be visual, but exceptions might be made if the student is enthusiastic and persuasive enough. The most straightforward project would be to build a system that classifies images into categories. A more difficult project might be to build a system that detects and localizes a type of object within an image. A still more complicated project might involve joining a ConvNet/Vision Transformer with an LSTM/Transformer for a problem (like image captioning) that requires vision and language. But again, creative and original projects will be judged more kindly.  

It is important to scope your project so that you get some working results. Project reports that say "I tried this and this but nothing seemed to work..." are discouraged. Above all, you should demonstrate end-to-end fluency in the basics of deep learning. 

I cannot wait to see the results. Good luck!

##### participants
Jordyn Kim (jk4671), Sun Kim (syk2145)

##### Project Objectives

The objective of this project is to develop a deep learning model to separate single-channel cinematic audio into three distinct components: voice, background music, and sound effects. This task, often referred to as the cocktail fork separation problem, is highly applicable in the entertainment industry and a prominent focus within the field of audio signal processing.

Based on our preliminary research, we expect voice separation to yield promising results, with clear and high-quality separation. Achieving precise separation of music and effects may, however, face some constraints due to (1) the computational overhead of our chosen architecture (RNN-based models), (2) limited available computational resources, and (3) additional time required for preprocessing audio signals into spectrograms.

##### Project Plan

The project will be completed in three main phases:
1.	Initial Setup and Data Preparation (Phase 1)
	- This phase includes setting up the necessary infrastructure, including establishing a Git repository, importing the Divide and Remaster (DnR) dataset, and converting audio data into spectrograms or mel-spectrograms using short-time Fourier transform (STFT).
2.	Voice vs. Non-Voice Separation (Phase 2)
	-	The focus of this phase is to isolate voice from background music and sound effects. We plan to test two different approaches:
	-	Implementing Band-Split RNN (BSRNN): This model, based on this paper https://arxiv.org/abs/2209.15174, employs the loss function combines frequency-domain and time-domain L1 losses and Adam optimizer for training.
	-	Fine-Tuning Pre-trained Models: We also aim to experiment with fine-tuning models pre-trained on musical source separation, Open-Unmix, and Demucs. As noted in this paper https://arxiv.org/pdf/2308.06981, models trained on music-only data tend to perform better in separating voice from non-voice. We can leverage pre-trained models available through the MUSDB18 music database.
3.	Music vs. Sound Effects Differentiation (Phase 3)
	-	In this final phase, we will build on the voice separation results by extending the band-split RNN model to further differentiate background music from sound effects. Similar network structures will be applied, and additional features or model variations may be introduced based on the results from Phase 2.

##### Training Data

We will use the Divide and Remaster (DnR) dataset https://zenodo.org/records/6949108, approximately 200GB in size, which is specifically structured for the cocktail fork separation problem. This dataset includes labeled annotations for music, sound effects, and speech to support structured training.

##### Deep Learning Framework

We plan to use PyTorch for model implementation due to its compatibility with pre-trained models like Demucs and Open-Unmix, as well as its flexibility for custom network structures.

##### Network Architecture

Our network architecture will primarily focus on the Band-Split RNN (BSRNN) model. The initial step in the pipeline involves splitting spectrogram data into sub-bands by frequency, where each band is then normalized and processed through a fully connected dense layer. The second step introduces two RNNs: one handling the frequency domain and the other the time domain using bidirectional LSTMs. The outputs from all bands are then merged to form the final representation.

In case the BSRNN implementation proves too complex, we will experiment with alternative pre-trained models, such as Open-Unmix and Demucs, to address the separation task.

##### Team Responsibilities

-	Data Collection and Preprocessing	
-	Model Implementation
-	Fine-Tuning and Integration of Pre-Trained Models	
-	Evaluation and Optimization

##### Evaluation

We will evaluate model performance using the Signal-to-Noise-and-Distortion Ratio (SDR) metric, a standard measure for assessing audio source separation quality.

### Project Presentations


To allow students to present their work in two class periods, each student will have only 3 minutes, not a second more. We will be strict about the timing, so you should practice your presentation. The key here is to get across three things: what you did, how you did it, and how well it worked. Students working in groups of two will get 6 minutes and groups of three will get 9 minutes. Note only half of the class will present during the last two days of class, but all of the groups will submit videos of their project. The time limit rules for the videos is the same as for the presentations. And the video can simply be a narration over a slideshow, but again each student in a group needs to present/narrate the work they did.  


### Project Reports

The report should be done as a Jupyter Notebook. The report should be a complete description of the objectives of the work, the methods used to solve the problem, experimental evidence of a working system, the code, and clear delineation of what you have done vs. what you are leveraging that others have done. If you have used the work of others YOU MUST INCLUDE ATTRIBUTION by citing this work inline and as part of a "bibliography" at the end. You should describe what worked, what did not, and why. If you are working in a group you need to submit your own report and this report should be clear about what your individual contribution was. It is ok to include your collaborators work in your report, but you must be clear about your section and write this yourself. This project report constitutes a large fraction of your final grade--take it seriously and include enough material and details for us to give you a good grade. If you are having trouble imagining the structure of the report, refer to published research papers in CV as a possible model.

## Training model

In [1]:
import sys
import os

from torch.utils.data import DataLoader
import torch.nn as nn
import torch

import numpy as np
import soundfile as sf
import librosa

from demucs import pretrained
# from demucs.pretrained import get_model
from demucs.apply import apply_model

# Add the source directory to the Python path
source_path = os.path.abspath("./source")
if source_path not in sys.path:
    sys.path.append(source_path)
import importlib


In [2]:
htdemucs = pretrained.get_model('htdemucs') # load pretrained htdemucs

# modify network to have 3 stems output
model = htdemucs.models[0]
model.sources = ['speech', 'music', 'sfx']
model.decoder[-1].conv_tr = torch.nn.ConvTranspose2d(
    in_channels=48,  
    out_channels=12,  # 3 stems * input channels (2 for stereo)
    kernel_size=(8, 1),
    stride=(4, 1)
)
model.tdecoder[-1].conv_tr = torch.nn.ConvTranspose1d(
    in_channels=48,  
    out_channels=6,
    kernel_size=8,
    stride=4
)
model

HTDemucs(
  (encoder): ModuleList(
    (0): HEncLayer(
      (conv): Conv2d(4, 48, kernel_size=(8, 1), stride=(4, 1), padding=(2, 0))
      (norm1): Identity()
      (rewrite): Conv2d(48, 96, kernel_size=(1, 1), stride=(1, 1))
      (norm2): Identity()
      (dconv): DConv(
        (layers): ModuleList(
          (0): Sequential(
            (0): Conv1d(48, 6, kernel_size=(3,), stride=(1,), padding=(1,))
            (1): GroupNorm(1, 6, eps=1e-05, affine=True)
            (2): GELU(approximate='none')
            (3): Conv1d(6, 96, kernel_size=(1,), stride=(1,))
            (4): GroupNorm(1, 96, eps=1e-05, affine=True)
            (5): GLU(dim=1)
            (6): LayerScale()
          )
          (1): Sequential(
            (0): Conv1d(48, 6, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
            (1): GroupNorm(1, 6, eps=1e-05, affine=True)
            (2): GELU(approximate='none')
            (3): Conv1d(6, 96, kernel_size=(1,), stride=(1,))
            (4): GroupNo

In [3]:
import dataset
importlib.reload(dataset)

# Initialize dataset
dataset_path = "DnR/dnr_small"  # Use small dataset for local testing. Replace with the full dataset 
train_path = os.path.join(dataset_path, "tr")  # Folder containing training data
val_path = os.path.join(dataset_path, "cv")  # Folder containing validation data
test_path = os.path.join(dataset_path, "tt")  # Folder containing test data

# Initialize datasets
chunk_size = 264600
train_dataset = dataset.DnR_Dataset(root_dir=train_path, sample_rate=44100, chunk_size=chunk_size)
val_dataset = dataset.DnR_Dataset(root_dir=val_path, sample_rate=44100, chunk_size=chunk_size)
test_dataset = dataset.DnR_Dataset(root_dir=test_path, sample_rate=44100, chunk_size=chunk_size)

# Initialize DataLoaders
batch_size = 8 # update batch size when using GPU
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [4]:
import train 
importlib.reload(train)

# Create output directory to save model
output_dir = "./models/local_train"
os.makedirs(output_dir, exist_ok=True)

# DEVICE = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
DEVICE = "cpu"

model.to(DEVICE)
# Train the model
trained_model = train.train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    early_stopping = train.EarlyStopping(),
    epochs=100,  
    learning_rate=1e-4,  # Learning rate
    output_dir=output_dir,
    device = DEVICE
)

  from .autonotebook import tqdm as notebook_tqdm


--------------------
Epoch 1/100


100%|██████████| 3/3 [03:49<00:00, 76.63s/it]


Training Loss: 0.00049921
Val Loss: 0.00059880
Validation loss improved to 0.00059880, model saved.
--------------------
Epoch 2/100


100%|██████████| 3/3 [04:05<00:00, 81.79s/it] 


Training Loss: 0.00044242
Val Loss: 0.00039084
Validation loss improved to 0.00039084, model saved.
--------------------
Epoch 3/100


100%|██████████| 3/3 [04:05<00:00, 81.82s/it] 


Training Loss: 0.00045285
Val Loss: 0.00045975
No improvement in validation loss. Counter: 1/10
--------------------
Epoch 4/100


100%|██████████| 3/3 [04:04<00:00, 81.61s/it] 


Training Loss: 0.00043116
Val Loss: 0.00043526
No improvement in validation loss. Counter: 2/10
--------------------
Epoch 5/100


100%|██████████| 3/3 [04:00<00:00, 80.14s/it] 


Training Loss: 0.00048843
Val Loss: 0.00048453
No improvement in validation loss. Counter: 3/10
--------------------
Epoch 6/100


100%|██████████| 3/3 [03:48<00:00, 76.27s/it] 


Training Loss: 0.00051201
Val Loss: 0.00037891
Validation loss improved to 0.00037891, model saved.
--------------------
Epoch 7/100


100%|██████████| 3/3 [03:42<00:00, 74.31s/it]


Training Loss: 0.00048181
Val Loss: 0.00044804
No improvement in validation loss. Counter: 1/10
--------------------
Epoch 8/100


100%|██████████| 3/3 [04:20<00:00, 86.75s/it] 


Training Loss: 0.00044949
Val Loss: 0.00032283
Validation loss improved to 0.00032283, model saved.
--------------------
Epoch 9/100


100%|██████████| 3/3 [04:02<00:00, 80.80s/it] 


Training Loss: 0.00050985
Val Loss: 0.00044025
No improvement in validation loss. Counter: 1/10
--------------------
Epoch 10/100


100%|██████████| 3/3 [04:11<00:00, 83.99s/it] 


Training Loss: 0.00051913
Val Loss: 0.00032449
No improvement in validation loss. Counter: 2/10
--------------------
Epoch 11/100


100%|██████████| 3/3 [03:53<00:00, 77.90s/it] 


Training Loss: 0.00058113
Val Loss: 0.00050744
No improvement in validation loss. Counter: 3/10
--------------------
Epoch 12/100


100%|██████████| 3/3 [03:51<00:00, 77.03s/it] 


Training Loss: 0.00044748
Val Loss: 0.00037881
No improvement in validation loss. Counter: 4/10
--------------------
Epoch 13/100


100%|██████████| 3/3 [03:55<00:00, 78.42s/it] 


Training Loss: 0.00048301
Val Loss: 0.00049471
No improvement in validation loss. Counter: 5/10
--------------------
Epoch 14/100


100%|██████████| 3/3 [03:54<00:00, 78.26s/it] 


Training Loss: 0.00043522
Val Loss: 0.00045775
No improvement in validation loss. Counter: 6/10
--------------------
Epoch 15/100


100%|██████████| 3/3 [04:06<00:00, 82.06s/it] 


Training Loss: 0.00052551
Val Loss: 0.00038158
No improvement in validation loss. Counter: 7/10
--------------------
Epoch 16/100


100%|██████████| 3/3 [04:05<00:00, 81.77s/it] 


Training Loss: 0.00053632
Val Loss: 0.00046285
No improvement in validation loss. Counter: 8/10
--------------------
Epoch 17/100


100%|██████████| 3/3 [04:06<00:00, 82.10s/it] 


Training Loss: 0.00042774
Val Loss: 0.00079286
No improvement in validation loss. Counter: 9/10
--------------------
Epoch 18/100


100%|██████████| 3/3 [04:04<00:00, 81.42s/it] 


Training Loss: 0.00048805
Val Loss: 0.00045458
No improvement in validation loss. Counter: 10/10
Early stopping triggered


In [5]:
# validate model output
def load_audio(audio_path, sample_rate=44100, device='cpu'):
    # Load audio with librosa
    audio, sr = librosa.load(audio_path, mono=False, sr=sample_rate)

    # Handle mono audio by duplicating the channel
    if len(audio.shape) == 1:  # Mono audio
        audio = np.stack([audio, audio], axis=0)  # Convert to stereo

    audio = np.expand_dims(audio, axis=0)  # Shape: [1, samples, channels]
    audio = torch.from_numpy(audio).float().to(device)

    return audio
audio_tensor = load_audio("DnR/dnr_small/tr/106/mix.wav") # change to correct path
out = apply_model(trained_model, audio_tensor, shifts=1, overlap=0.8)[0].cpu().numpy()
print(out.shape)

for i, source in enumerate(out[:,:,:]):
    source = np.array(source).mean(axis=0)  # Convert tensor to numpy array
    sf.write(f'small_train_masked_loss{i}.wav', source.T, 44100)  # Transpose if needed for correct shape

(3, 2, 2646000)


In [22]:
# reference: https://github.com/ZFTurbo/Music-Source-Separation-Training/blob/730d162b2ef31a1bba5d4a7d40ae914459a011b8/utils.py#L221
# TODO: move evaluation to separate module
def sdr(references, estimates):
    # compute SDR for one song
    delta = 1e-7  # avoid numerical errors
    num = np.sum(np.square(references), axis=(1, 2))
    den = np.sum(np.square(references - estimates), axis=(1, 2))
    num += delta
    den += delta
    return 10 * np.log10(num / den)

def si_sdr(reference, estimate):
    eps = 1e-07
    scale = np.sum(estimate * reference + eps, axis=(0, 1)) / np.sum(reference**2 + eps, axis=(0, 1))
    scale = np.expand_dims(scale, axis=(0, 1))  # shape - [50, 1]
    reference = reference * scale
    sisdr = np.mean(10 * np.log10(np.sum(reference**2, axis=(0, 1)) / (np.sum((reference - estimate)**2, axis=(0, 1)) + eps) + eps))
    return sisdr

references_path_prefix = "DnR/dnr_small/tr/106"
predicted_file_prefix = "small_train_masked_loss"
references = []
predicted = []
for i, source in enumerate(model.sources):
    references += load_audio(f"{references_path_prefix}/{source}.wav")
    predicted += load_audio(f"{predicted_file_prefix}{i}.wav")

references = np.stack(references)  # Shape: [num_sources, channels, samples]
predicted = np.stack(predicted)  # Shape: [num_sources, channels, samples]

# Compute SDR
mean_sdr = np.mean(sdr(references, predicted))
print(f"Mean SDR: {mean_sdr}")

si_mean_sdr = np.mean(si_sdr(references, predicted))
print(f"Mean si_SDR: {si_mean_sdr}")

Mean SDR: -0.19516621530056
Mean si_SDR: -14.309621810913086
