## 🔹 Step 1: Extract Video & Audio from MP4  

In this step, we **separate the video and audio** from an MP4 file using **FFmpeg** inside a specific Conda environment.  

We perform two operations:  
- **Extract video only (no audio)**
- **Extract audio only (no video)**  

To ensure the correct FFmpeg version is used, we run it via `conda run` from the specified Conda environment.

✅ **Input:**  
- `video1.mp4` (original video with audio)

✅ **Output:**  
- `media_1_video.mp4` (only video, no audio)  
- `media_1_audio.mp3` (only audio, no video)  

Run the following code to perform the extraction:


In [None]:
import subprocess

# Define your Conda environment name
CONDA_ENV = "tataplay"  # Change this to your actual Conda environment name

def extract_video(input_file, output_video):
    command = ["conda", "run", "-n", CONDA_ENV, "ffmpeg", "-i", input_file, "-c:v", "copy", "-an", output_video]
    subprocess.run(command, check=True)

def extract_audio(input_file, output_audio):
    command = ["conda", "run", "-n", CONDA_ENV, "ffmpeg", "-i", input_file, "-q:a", "0", "-map", "a", output_audio]
    subprocess.run(command, check=True)

# Example Usage
input_file = "video1.mp4"
output_video = "media_1_video.mp4"
output_audio = "media_1_audio.mp3"

extract_video(input_file, output_video)
extract_audio(input_file, output_audio)

print("Video and Audio extracted successfully!")


## 🔹 Step 2: Separate Vocals & Background with Demucs (Using Conda)
We use **Demucs** to separate vocals and background music, ensuring it runs inside the correct **Conda environment**.

### ✅ **Process:**
- **Input:** `output_audio.mp3` (extracted from video)
- **Output:**
  - `vocals.wav` → 🎤 **Vocals only**
  - `no_vocals.wav` → 🎼 **Music only**

### 🔹 **Command Used (Runs in `openvoice` Conda Env)**
```bash
conda run -n openvoice demucs --two-stems vocals --out separated_audio output_audio.mp3
```

---


In [2]:
import subprocess

def separate_audio_with_demucs_conda(input_audio, output_dir="separated_audio", conda_env="openvoice"):
    """
    Runs Demucs using subprocess via conda run to ensure it executes in the correct environment.

    Args:
        input_audio (str): Path to input audio file (MP3/WAV).
        output_dir (str): Directory to save separated audio.
        conda_env (str): Name of the Conda environment with Demucs installed.
    
    Returns:
        None
    """
    try:
        # Command to run Demucs inside the specified Conda environment
        command = [
            "conda", "run", "-n", conda_env,  # Run in the Conda environment
            "demucs", "--two-stems", "vocals",  # Extract vocals & background
            "--out", output_dir,  # Save to output directory
            input_audio  # Input audio file
        ]

        subprocess.run(command, check=True)
        print(f"✅ Separation complete! Check the '{output_dir}' folder.")

    except subprocess.CalledProcessError as e:
        print(f"❌ Error: {e}")

# Example Usage
separate_audio_with_demucs_conda("media_1_audio.mp3", conda_env="tataplay")


[1mImportant: the default model was recently changed to `htdemucs`[0m the latest Hybrid Transformer Demucs model. In some cases, this model can actually perform worse than previous models. To get back the old default model use `-n mdx_extra_q`.
Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /home/skamalj/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
Selected model is a bag of 1 models. You will see that many progress bars per track.
Separated tracks will be stored in /mnt/d/dev/tataplay/separated_audio/htdemucs
Separating track media_1_audio.mp3

✅ Separation complete! Check the 'separated_audio' folder.



  0%|          | 0.00/80.2M [00:00<?, ?B/s]
  2%|▏         | 1.75M/80.2M [00:00<00:04, 16.5MB/s]
  6%|▌         | 4.50M/80.2M [00:00<00:03, 23.4MB/s]
  9%|▊         | 6.88M/80.2M [00:00<00:03, 23.4MB/s]
 12%|█▏        | 9.62M/80.2M [00:00<00:02, 25.0MB/s]
 16%|█▌        | 12.9M/80.2M [00:00<00:02, 24.9MB/s]
 19%|█▉        | 15.5M/80.2M [00:00<00:02, 25.4MB/s]
 24%|██▍       | 19.6M/80.2M [00:00<00:02, 30.7MB/s]
 28%|██▊       | 22.6M/80.2M [00:00<00:02, 29.5MB/s]
 32%|███▏      | 25.5M/80.2M [00:01<00:04, 14.1MB/s]
 35%|███▍      | 27.8M/80.2M [00:01<00:05, 9.93MB/s]
 37%|███▋      | 29.5M/80.2M [00:02<00:06, 8.70MB/s]
 38%|███▊      | 30.9M/80.2M [00:02<00:06, 8.58MB/s]
 40%|████      | 32.1M/80.2M [00:02<00:06, 8.20MB/s]
 42%|████▏     | 33.4M/80.2M [00:02<00:05, 8.61MB/s]
 43%|████▎     | 34.4M/80.2M [00:02<00:05, 8.61MB/s]
 44%|████▍     | 35.4M/80.2M [00:02<00:05, 8.67MB/s]
 45%|��███▌     | 36.4M/80.2M [00:02<00:05, 8.96MB/s]
 47%|████▋     | 37.4M/80.2M [00:03<00:05, 7.78MB/s]


In [None]:
from openai import OpenAI
import json

# Audio file path
#audio_path = "separated_audio/htdemucs/media_1_audio/vocals.wav"
audio_path = "media_1_audio.mp3"
client = OpenAI()
# Open the audio file for transcription
with open(audio_path, "rb") as audio_file:
    response =  client.audio.transcriptions.create(
        model="whisper-1",  # OpenAI Whisper API model
        file=audio_file,
        response_format="verbose_json",  # Ensure JSON output
        timestamp_granularities=["word", "segment"]  # Enable detailed timestamps
    )

print(response)
# Extract useful data in JSON format
transcription_data = {
    "text": response.text,  # Full transcription
    "language": getattr(response, "language", "unknown"),  # Detected language
    "segments": []  # List of segments with timestamps
}

# Preserve detailed segment & word-level timestamps
if hasattr(response, "segments"):
    for segment in response.segments:
        transcription_data["segments"].append({
            "start": segment.start,  # Accessing attribute with dot notation
            "end": segment.end,      # Accessing attribute with dot notation
            "text": segment.text,    # Accessing attribute with dot notation
            "confidence": getattr(segment, "confidence", 1.0),  # Default confidence
        })

# Save as JSON
with open("transcription.json", "w", encoding="utf-8") as file:
    json.dump(transcription_data, file, indent=4, ensure_ascii=False)

print(f"Transcription saved to transcription.json (Detected Language: {response.language})")

In [None]:
from huggingface_hub import login
login()

In [None]:
import json
from pyannote.audio import Pipeline
import os

# Load the pretrained speaker diarization model
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=os.getenv("HUGGINGFACE_AUTH_TOKEN_TATA"))

# Define input audio file
audio_file = "media_1_audio.mp3"

# Run diarization **once** and store in a variable
diarization_result = list(pipeline(audio_file).itertracks(yield_label=True))
print(diarization_result)
# Load existing Whisper transcription
with open("transcription.json", "r", encoding="utf-8") as f:
    transcription_data = json.load(f)

# Assign speakers to transcription segments using stored diarization results
for segment in transcription_data["segments"]:
    segment_start, segment_end = segment["start"], segment["end"]
    
    for turn, _, speaker in diarization_result:
        if turn.start <= segment_start <= turn.end or turn.start <= segment_end <= turn.end:
            segment["speaker_id"] = speaker
            break  # Stop searching once a match is found

# Save updated transcription with speaker IDs
with open("transcription.json", "w", encoding="utf-8") as f:
    json.dump(transcription_data, f, indent=4, ensure_ascii=False)

print("✅ Updated transcription.json with speaker IDs!")
