<a href="https://colab.research.google.com/github/ttecles/aidl-lyrics-recognition/blob/main/SourceSeparation%2BSpeechRecognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Colab contains the final model, build from the concatenation of Demucs (for source separation task) and Wav2Vec (for speech recognition task)

Download the libraries needed (Demucs, Transformers[torch])

In [None]:
!pip install -Uq demucs

[K     |████████████████████████████████| 61kB 3.2MB/s 
[K     |████████████████████████████████| 194kB 10.5MB/s 
[K     |████████████████████████████████| 51kB 5.7MB/s 
[K     |████████████████████████████████| 1.9MB 25.9MB/s 
[?25h  Building wheel for demucs (setup.py) ... [?25l[?25hdone
  Building wheel for diffq (setup.py) ... [?25l[?25hdone
  Building wheel for julius (setup.py) ... [?25l[?25hdone


In [None]:
!pip install transformers[torch]

Collecting transformers[torch]
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 5.3MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 27.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K

Imports

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torchaudio
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

from IPython.display import Audio, display
import IPython.display as ipd
import soundfile as sf



#Import models:

from demucs.pretrained import load_pretrained
from transformers import AutoTokenizer, Wav2Vec2ForCTC, Wav2Vec2Processor

Download our two models:

In [None]:
Demucs = load_pretrained("demucs")

Wav2Vec = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
Wav2Vec_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Downloading: "https://dl.fbaipublicfiles.com/demucs/v3.0/demucs-e07c671f.th" to /root/.cache/torch/hub/checkpoints/demucs-e07c671f.th


HBox(children=(FloatProgress(value=0.0, max=1062738817.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1596.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=377667514.0, style=ProgressStyle(descri…




Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=159.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=291.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=163.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=85.0, style=ProgressStyle(description_w…




Utility function to change sample rate

In [None]:
class ChangeSampleRate(nn.Module):
    def __init__(self, input_rate: int, output_rate: int):
        super().__init__()
        self.output_rate = output_rate
        self.input_rate = input_rate

    def forward(self, wav: torch.tensor) -> torch.tensor:
        # Only accepts 1-channel waveform input
        wav = wav.view(wav.size(0), -1)
        new_length = wav.size(-1) * self.output_rate // self.input_rate
        indices = (torch.arange(new_length) * (self.input_rate / self.output_rate))
        round_down = wav[:, indices.long()]
        round_up = wav[:, (indices.long() + 1).clamp(max=wav.size(-1) - 1)]
        output = round_down * (1. - indices.fmod(1.)).unsqueeze(0) + round_up * indices.fmod(1.).unsqueeze(0)
        return output

### **Demucs+Wav2Vec**

In [None]:
class DemucsWav2Vec(nn.Module):
    def __init__(self):
      super().__init__()
      self.demucs = load_pretrained("demucs")
      self.wav2vec = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
      self.wav2vec_processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
      self.changesamplerate = torchaudio.transforms.Resample(44100, 16000)
 
    def forward(self, input_tensor):
 
      #Demucs:
      out_demucs = self.demucs(input_tensor)
 
      #Extract voice and squeeze:
      output_voice = out_demucs[:, 3, :, :]
 
      #transform from stereo to mono:
      output_voice_mono = torch.mean(output_voice, dim=1)
 
      #change sample rate:
      
      output_voice_mono_sr = self.changesamplerate(output_voice_mono)

      #Wav2Vec processor function:
 
      input_values = (output_voice_mono_sr - output_voice_mono_sr.mean(1)) / torch.sqrt(output_voice_mono_sr.var(1) + 1e-5)
 
      #Wav2Vec:
      logits = self.wav2vec(input_values).logits
      predicted_ids = torch.argmax(logits, dim=-1)
      transcription = self.wav2vec_processor.decode(predicted_ids[0])
 
      return transcription

Import random audio mp3 and get input tensor

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
audio, _ = torchaudio.load("/content/gdrive/My Drive/Help2.mp3")
print(type(audio))
sr = 44100 # sample rate
ipd.Audio(audio, rate=sr, autoplay=False)

<class 'torch.Tensor'>


In [None]:
input_tensor_model = torch.cat([audio.unsqueeze(0)])
print(input_tensor_model.shape)

torch.Size([1, 2, 460800])


Try the model

In [None]:
Mymodel = DemucsWav2Vec()
transcription = Mymodel(input_tensor_model)
print(transcription)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AND I WAS YOUNG  SO RUCH YOUNGER THAN JODA ON TNEMENE ANY ODIS HALE BEEN N ANYWAY
