# Open Source Audio Restoration Pipeline

This notebook implements the "Hacker Workflow" for the Restoring Robinson project.

**Steps:**
1. **Separation:** Use Facebook Demucs to strip room noise (isolating the vocal 'stem').
2. **Transcription:** Use OpenAI Whisper to generate text.
3. **Cloning:** Use Coqui XTTS to regenerate the voice (Optional/Advanced).

In [None]:
# Install dependencies
!pip install -q demucs openai-whisper torch torchaudio pydub

## 1. Audio Cleaning (Stem Separation)
Instead of standard "noise reduction" which hurts frequencies, we use Demucs to separate the audio into tracks (Drums, Bass, Other, **Vocals**). We only keep the Vocals.

In [None]:
import demucs.separate
import shlex
import subprocess
from pathlib import Path

# CONFIGURATION
input_file = "../data/raw/robinson_neoclassical_part1.mp3"
output_dir = "../data/processed/"

# Run Demucs (Model: htdemucs - high quality hybrid transformer)
# -n htdemucs: The model to use
# --two-stems=vocals: Only separate vocals vs. everything else (saves time)
command = f"demucs -n htdemucs --two-stems=vocals -o {output_dir} '{input_file}'"

print(f"Processing {input_file}... this may take time depending on GPU.")
subprocess.run(shlex.split(command))

print("Separation complete. Check the 'htdemucs' folder in processed data.")

## 2. Transcription with Whisper
Now we take the isolated vocal track and transcribe it.

In [None]:
import whisper

# Load the CLEANED audio (path will vary based on Demucs output)
clean_audio_path = f"{output_dir}/htdemucs/robinson_neoclassical_part1/vocals.wav"

# Load Whisper Model (sizes: tiny, base, small, medium, large)
# 'medium' is a good balance of speed/accuracy for English
model = whisper.load_model("medium")

print("Transcribing...")
result = model.transcribe(clean_audio_path)

# Save Transcript
with open("../data/output/transcript.txt", "w") as f:
    f.write(result["text"])

print("Transcription saved to data/output/transcript.txt")
print(result["text"][:500] + "...") # Preview

## 3. Voice Cloning (Coqui XTTS)
*Note: This requires significant VRAM. If running locally on a weak machine, skip this or use the Commercial Workflow.*

We use XTTS-v2, which can clone a voice from a 6-second sample.

In [None]:
from TTS.api import TTS
import torch

# Initialize TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# The text you want her to say
target_text = "It is a great error to confuse the cost of money with the cost of capital."

# Generate Audio
# speaker_wav should be a short (10s) clip of the CLEANEST audio you have of her
tts.tts_to_file(
    text=target_text,
    speaker_wav=clean_audio_path, 
    language="en",
    file_path="../data/output/robinson_regenerated.wav"
)

print("Audio generated!")