# Interview Skill Improver API

This Google Colab notebook sets up a Flask API endpoint using `ngrok` to process video files. It performs the following tasks:
1.  **Video Frame Extraction & Analysis:** Extracts frames from the input video, performs person detection and emotion analysis using Moondream2.
2.  **Audio Extraction & Transcription:** Extracts audio from the video and transcribes it using OpenAI's Whisper model.
3.  **Audio Emotion Analysis:** Analyzes the emotion in the extracted audio using a pre-trained sentiment model.
4.  **CSM-1B Audio Generation:** Generates an audio comment using `sesame/csm-1b` based on the transcribed text.

The API exposes two endpoints:
* `/process` (POST): Takes a video file, processes it, and returns JSON containing frame analysis, transcript, and audio emotion.
* `/audio.wav` (GET): Serves the last generated `audio.wav` file (from `csm_generate`).

---

## 1. Setup and Dependencies

Run this cell to install all necessary Python libraries. This might take a few minutes.

```python
!pip install -qqq flask pyngrok moviepy opencv-python transformers accelerate bitsandbytes torchaudio librosa
```

In [3]:
!pip install -qqq flask pyngrok moviepy opencv-python transformers accelerate bitsandbytes torchaudio librosa

In [7]:
#@title Connect to ngrok 🌐

#@markdown To expose your local server to the internet, you need an `ngrok` authtoken.
#@markdown 1. Go to [https://ngrok.com/signup](https://ngrok.com/signup) and sign up for a free account.
#@markdown 2. Visit your [ngrok Dashboard](https://dashboard.ngrok.com/get-started/your-authtoken) to find your authtoken.
#@markdown 3. Paste your authtoken into the field below and run this cell.

NGROK_AUTH_TOKEN = "2zHmoBMCGzBDa28fosxaf8K2AzK_4Tb4qqU4BQEZui4iSk6yJ" #@param {type:"string"}

from pyngrok import ngrok
import os


if NGROK_AUTH_TOKEN:
    try:
        ngrok.set_auth_token(NGROK_AUTH_TOKEN)
        print("✅ ngrok authtoken set successfully.")
    except Exception as e:
        print(f"❌ Error setting ngrok authtoken: {e}")
        print("Please ensure your token is correct and try again.")
else:
    print("⚠️ NGROK_AUTH_TOKEN is empty. Please paste your token above.")

✅ ngrok authtoken set successfully.


In [8]:
!pip install -qqq flask pyngrok moviepy opencv-python librosa transformers torch torchaudio pillow

In [15]:
from pyngrok import ngrok
import os; #token = os.environ["NGROK_API_KEY"]; ngrok.set_auth_token(token)
from flask import Flask, request, jsonify, send_file
from pyngrok import ngrok
from moviepy.editor import VideoFileClip
import cv2, torch, numpy as np, json, base64
from PIL import Image
from transformers import (AutoModelForCausalLM, AutoProcessor as CSMProcessor,
                          CsmForConditionalGeneration, WhisperProcessor,
                          WhisperForConditionalGeneration, AutoFeatureExtractor,
                          AutoModelForAudioClassification)
public_url = ngrok.connect(5000); print(public_url)
device = "cuda" if torch.cuda.is_available() else "cpu"
moondream = AutoModelForCausalLM.from_pretrained("vikhyatk/moondream2", revision="2025-06-21", trust_remote_code=True, device_map={"":device})
whisper_proc = WhisperProcessor.from_pretrained("openai/whisper-small")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device); whisper_model.config.forced_decoder_ids=None
sent_model_id="firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
sent_model=AutoModelForAudioClassification.from_pretrained(sent_model_id).to(device)
sent_feat=AutoFeatureExtractor.from_pretrained(sent_model_id, do_normalize=True)
id2label=sent_model.config.id2label
csm_id="sesame/csm-1b"
csm_proc=CSMProcessor.from_pretrained(csm_id)
csm_model=CsmForConditionalGeneration.from_pretrained(csm_id, device_map=device)
app=Flask(__name__)
def process_video(i,o):
    cap=cv2.VideoCapture(i); fps=cap.get(cv2.CAP_PROP_FPS)
    w,h=int(min(cap.get(3),1280)),int(min(cap.get(4),720))
    os.makedirs(o,exist_ok=True)
    interval=max(int(fps//20),1); maxf=int(fps*120); cnt=0; sv=0; res={}
    while cnt<maxf and cap.read()[0]:
        ret,frm=cap.read()
        if ret and cnt%interval==0:
            img=cv2.resize(frm,(w,h)); fn=f"{sv:05d}.jpg"; cv2.imwrite(f"{o}/{fn}",img)
            pil=Image.fromarray(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))
            res[fn]={"people":moondream.query(pil,"How many people are there?")["answer"],
                     "emotion":moondream.query(pil,"What is the emotional and physical state of the person? Consider stress, anxiety, fear, calmness, confidence, relaxation, attentiveness, casualness, or discomfort.")["answer"]}
            sv+=1
        cnt+=1
    cap.release(); return res
def extract_audio(v,a): VideoFileClip(v).audio.write_audiofile(a)
def transcribe(a):
    import torchaudio
    wav,sr=torchaudio.load(a)
    if sr!=16000: wav=torch.mean(torchaudio.transforms.Resample(sr,16000)(wav),0,True)
    inp=whisper_proc(wav.squeeze(),sampling_rate=16000,return_tensors="pt")["input_features"].to(device)
    return whisper_proc.batch_decode(whisper_model.generate(inp),skip_special_tokens=True)[0]
def sentiment(a):
    import librosa
    d,s=librosa.load(a,sr=sent_feat.sampling_rate)
    ml=int(sent_feat.sampling_rate*30)
    d=d[:ml] if len(d)>ml else np.pad(d,(0,ml-len(d)))
    inp=sent_feat(d,sampling_rate=sent_feat.sampling_rate,return_tensors="pt").to(device)
    return id2label[torch.argmax(sent_model(**inp).logits, -1).item()]
def generate_comment(t):
    inp=csm_proc(f"[0]{t}",add_special_tokens=True).to(device)
    audio=csm_model.generate(**inp,output_audio=True)
    path="audio.wav"; csm_proc.save_audio(audio,path); return path
@app.route('/process',methods=['POST'])
def p():
    request.files['file'].save('in.mp4')
    f=process_video('in.mp4','frames'); extract_audio('in.mp4','audio.wav')
    txt=transcribe('audio.wav'); emo=sentiment('audio.wav'); c=generate_comment(txt)
    return jsonify({"frames":f,"transcript":txt,"audio_emotion":emo}), 200
@app.route('/audio.wav')
def a(): return send_file('audio.wav')
app.run(host='0.0.0.0',port=5000)


NgrokTunnel: "https://ee05c0d47a1d.ngrok-free.app" -> "http://localhost:5000"


preprocessor_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

transformers.safetensors.index.json:   0%|          | 0.00/59.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

transformers-00002-of-00002.safetensors:   0%|          | 0.00/2.19G [00:00<?, ?B/s]

transformers-00001-of-00002.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:45:47] "[33mGET / HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:45:47] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:45:51] "[33mGET / HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:45:52] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:46:54] "[33mGET / HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:46:54] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:46:58] "[33mGET / HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [24/Jul/2025 16:46:58] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -


In [1]:
#@title Connect to Huggingface
#@markdown
#@markdown 1. Create a Hugging Face account if you don't have one: https://huggingface.co/join
#@markdown 2. Go to your settings to find your User Access Tokens: https://huggingface.co/settings/tokens
#@markdown 3. Create a read token (or a write token if you plan to push models).
#@markdown 4. Paste your token into the field below and run this cell.

from huggingface_hub import login
import os

HUGGINGFACE_TOKEN = "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" #@param {type:"string"}

if HUGGINGFACE_TOKEN:
    try:
        login(token=HUGGINGFACE_TOKEN)
        print("✅ Successfully logged in to Hugging Face Hub.")
    except Exception as e:
        print(f"❌ Error logging in to Hugging Face Hub: {e}")
        print("Please ensure your token is correct and try again.")
else:
    print("⚠️ Hugging Face token is empty. Some models might fail to load if they require authentication.")

✅ Successfully logged in to Hugging Face Hub.


In [1]:
# STEP 1: Install dependencies
!pip install transformers torchaudio librosa moviepy accelerate --quiet

# STEP 2: Upload your video
# from google.colab import files
# uploaded = files.upload()

# import shutil
# video_path = next(iter(uploaded))
# shutil.move(video_path, 'input.mp4')


In [2]:
# STEP 3: Import & Load All Models
import torch, os, json, cv2, numpy as np
from moviepy.editor import VideoFileClip
from PIL import Image
from transformers import (
    AutoModelForCausalLM, AutoProcessor as CSMProcessor,
    CsmForConditionalGeneration,
    WhisperProcessor, WhisperForConditionalGeneration,
    AutoFeatureExtractor, AutoModelForAudioClassification
)

device = "cuda" if torch.cuda.is_available() else "cpu"

moondream = AutoModelForCausalLM.from_pretrained("vikhyatk/moondream2", revision="2025-06-21", trust_remote_code=True, device_map={"":device})

whisper_proc = WhisperProcessor.from_pretrained("openai/whisper-small")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
whisper_model.config.forced_decoder_ids = None

sent_model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
sent_model = AutoModelForAudioClassification.from_pretrained(sent_model_id).to(device)
sent_feat = AutoFeatureExtractor.from_pretrained(sent_model_id, do_normalize=True)
id2label = sent_model.config.id2label

csm_id = "sesame/csm-1b"
csm_proc = CSMProcessor.from_pretrained(csm_id)
csm_model = CsmForConditionalGeneration.from_pretrained(csm_id, device_map=device)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
#@title all frames
# STEP 4: Define processing functions
from torchvision import transforms

def process_video(i, o):
    cap = cv2.VideoCapture(i)
    fps = cap.get(cv2.CAP_PROP_FPS)
    w, h = int(min(cap.get(3),1280)), int(min(cap.get(4),720))
    os.makedirs(o, exist_ok=True)
    interval = max(int(fps // 20), 1)
    maxf = int(fps * 120)
    cnt = 0
    sv = 0
    res = {}

    while cnt < maxf and cap.read()[0]:
        ret, frm = cap.read()
        if ret and cnt % interval == 0:
            img = cv2.resize(frm, (w, h))
            fn = f"{sv:05d}.jpg"
            cv2.imwrite(f"{o}/{fn}", img)
            pil = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
            res[fn] = {
                "people": moondream.query(pil, "How many people are there?")["answer"],
                "emotion": moondream.query(pil, "What is the emotional and physical state of the person? Consider stress, anxiety, fear, calmness, confidence, relaxation, attentiveness, casualness, or discomfort.")["answer"]
            }
            sv += 1
        cnt += 1

    cap.release()
    return res

def extract_audio(video_path, audio_path):
    VideoFileClip(video_path).audio.write_audiofile(audio_path, verbose=False, logger=None)

def transcribe(audio_path):
    import torchaudio
    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torch.mean(torchaudio.transforms.Resample(sr, 16000)(wav), 0, True)
    inp = whisper_proc(wav.squeeze(), sampling_rate=16000, return_tensors="pt")["input_features"].to(device)
    out = whisper_model.generate(inp)
    return whisper_proc.batch_decode(out, skip_special_tokens=True)[0]

def sentiment(audio_path):
    import librosa
    d, sr = librosa.load(audio_path, sr=sent_feat.sampling_rate)
    ml = int(sr * 30)
    d = d[:ml] if len(d) > ml else np.pad(d, (0, ml - len(d)))
    inp = sent_feat(d, sampling_rate=sr, return_tensors="pt").to(device)
    out = sent_model(**inp)
    return id2label[torch.argmax(out.logits, -1).item()]

def generate_comment(text):
    inp = csm_proc(f"[0]{text}", add_special_tokens=True).to(device)
    audio = csm_model.generate(**inp, output_audio=True)
    out_path = "audio_comment.wav"
    csm_proc.save_audio(audio, out_path)
    return out_path


In [7]:
# STEP 5: Run full pipeline
video_path = "/content/vd0.mp4"
frames_dir = "frames"
audio_path = "audio.wav"

# Process video -> frames + Moondream analysis
frame_results = process_video(video_path, frames_dir)
with open("frames.json", "w") as f:
    json.dump(frame_results, f, indent=2)

# Extract audio
extract_audio(video_path, audio_path)

# Whisper transcription
transcript = transcribe(audio_path)
with open("transcript.txt", "w") as f:
    f.write(transcript)

# Audio sentiment
emo = sentiment(audio_path)
with open("emotion.txt", "w") as f:
    f.write(emo)

# Generate feedback audio from text
csm_audio = generate_comment(transcript)


Video has 1242 frames. Sampling 20 frames.


OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 48.12 MiB is free. Process 107009 has 14.69 GiB memory in use. Of the allocated memory 14.44 GiB is allocated by PyTorch, and 131.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# @title Default title text
# import torch
# import gc

# if 'moondream' in locals() and moondream is not None:
#     moondream.to('cpu')
#     del moondream
#     print("Moondream model deleted.")

# if 'whisper_model' in locals() and whisper_model is not None:
#     whisper_model.to('cpu')
#     del whisper_model
#     print("Whisper model deleted.")

# if 'sent_model' in locals() and sent_model is not None:
#     sent_model.to('cpu')
#     del sent_model
#     print("Sentiment model deleted.")

# if 'csm_model' in locals() and csm_model is not None:
#     csm_model.to('cpu')
#     del csm_model
#     print("CSM-1B model deleted.")

# gc.collect()
# print("Python garbage collector run.")

# if torch.cuda.is_available():
#     torch.cuda.empty_cache()
#     print("CUDA cache emptied.")

# if torch.cuda.is_available():
#     allocated = torch.cuda.memory_allocated()
#     cached = torch.cuda.memory_reserved()
#     print(f"\nCUDA Memory after cleanup:")
#     print(f"  Allocated: {allocated / (1024**3):.2f} GB")
#     print(f"  Cached: {cached / (1024**3):.2f} GB")
# else:
#     print("\nCUDA not available or already cleared.")

Moondream model deleted.
Whisper model deleted.
Sentiment model deleted.


In [3]:
#@title Random 20 frames
# STEP 4: Define processing functions
import random # Import the random module for sampling
from torchvision import transforms # Keep this if you use it elsewhere, though not directly used in the modified process_video

def process_video(i, o):
    cap = cv2.VideoCapture(i)
    if not cap.isOpened():
        print(f"Error: Could not open video file {i}")
        return {}

    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    w, h = int(min(cap.get(3),1280)), int(min(cap.get(4),720))
    os.makedirs(o, exist_ok=True)
    res = {}

    # Ensure we don't try to sample more frames than available
    num_frames_to_sample = min(20, total_frames)

    # Randomly select 20 unique frame indices
    # We use range(total_frames) to get all possible frame indices
    selected_frame_indices = sorted(random.sample(range(total_frames), num_frames_to_sample))

    print(f"Video has {total_frames} frames. Sampling {num_frames_to_sample} frames.")

    for i, frame_idx in enumerate(selected_frame_indices):
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx) # Set the current frame position
        ret, frm = cap.read()

        if ret:
            img = cv2.resize(frm, (w, h))
            fn = f"frame_{frame_idx:05d}.jpg" # Use actual frame index in filename
            cv2.imwrite(f"{o}/{fn}", img)
            pil = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))

            # Moondream queries
            people_query_result = moondream.query(pil, "How many people are there?")
            emotion_query_result = moondream.query(pil, "What is the emotional and physical state of the person? Consider stress, anxiety, fear, calmness, confidence, relaxation, attentiveness, casualness, or discomfort.")

            res[fn] = {
                "people": people_query_result['answer'] if people_query_result and 'answer' in people_query_result else "N/A",
                "emotion": emotion_query_result['answer'] if emotion_query_result and 'answer' in emotion_query_result else "N/A"
            }
            print(f"Processed frame {i+1}/{num_frames_to_sample} (original index: {frame_idx})")
        else:
            print(f"Warning: Could not read frame at index {frame_idx}")

    cap.release()
    print(f"Completed processing {len(res)} randomly sampled frames.")
    return res

def extract_audio(video_path, audio_path):
    print(f"Extracting audio from {video_path} to {audio_path}...")
    try:
        VideoFileClip(video_path).audio.write_audiofile(audio_path, verbose=False, logger=None)
        print("Audio extraction complete.")
    except Exception as e:
        print(f"Error during audio extraction: {e}")
        raise # Re-raise the exception to be caught by the API handler

def transcribe(audio_path):
    import torchaudio
    print(f"Transcribing audio from {audio_path}...")
    try:
        wav, sr = torchaudio.load(audio_path)
        if sr != 16000:
            # Ensure audio is mono before resampling if it's multichannel
            if wav.shape[0] > 1:
                wav = torch.mean(wav, dim=0, keepdim=True)
            wav = torchaudio.transforms.Resample(sr, 16000)(wav)

        # Squeeze to ensure it's 1D for whisper_proc if it somehow remains 2D (1, N)
        if wav.dim() > 1:
            wav = wav.squeeze(0)

        inp = whisper_proc(wav, sampling_rate=16000, return_tensors="pt")["input_features"].to(device)
        out = whisper_model.generate(inp)
        transcript = whisper_proc.batch_decode(out, skip_special_tokens=True)[0]
        print("Transcription complete.")
        return transcript
    except Exception as e:
        print(f"Error during transcription: {e}")
        return "Transcription failed."

def sentiment(audio_path):
    import librosa
    print(f"Performing sentiment analysis on {audio_path}...")
    try:
        d, sr = librosa.load(audio_path, sr=sent_feat.sampling_rate)
        ml = int(sr * 30) # Max 30 seconds for analysis
        d = d[:ml] if len(d) > ml else np.pad(d, (0, ml - len(d)))
        inp = sent_feat(d, sampling_rate=sr, return_tensors="pt").to(device)
        out = sent_model(**inp)
        emotion_label = id2label[torch.argmax(out.logits, -1).item()]
        print(f"Sentiment analysis complete: {emotion_label}")
        return emotion_label
    except Exception as e:
        print(f"Error during sentiment analysis: {e}")
        return "Sentiment analysis failed."

def generate_comment(text):
    print(f"Generating CSM audio for text: '{text[:50]}...'") # Print first 50 chars
    try:
        inp = csm_proc(f"[0]{text}", add_special_tokens=True, return_tensors="pt").to(device)
        audio = csm_model.generate(**inp, output_audio=True)
        out_path = "audio_comment.wav"
        csm_proc.save_audio(audio, out_path)
        print(f"CSM audio generated and saved to {out_path}")
        return out_path
    except Exception as e:
        print(f"Error generating CSM audio: {e}")
        return None

In [4]:
# STEP 5: Run full pipeline
video_path = "/content/vd0.mp4"
frames_dir = "frames"
audio_path = "audio.wav"

# Process video -> frames + Moondream analysis
frame_results = process_video(video_path, frames_dir)
with open("frames.json", "w") as f:
    json.dump(frame_results, f, indent=2)

# Extract audio
extract_audio(video_path, audio_path)

# Whisper transcription
transcript = transcribe(audio_path)
with open("transcript.txt", "w") as f:
    f.write(transcript)

del whisper_proc
del whisper_model

# Audio sentiment
emo = sentiment(audio_path)
with open("emotion.txt", "w") as f:
    f.write(emo)

# Generate feedback audio from text
csm_audio = generate_comment(transcript)


Video has 1242 frames. Sampling 20 frames.
Processed frame 1/20 (original index: 7)
Processed frame 2/20 (original index: 147)
Processed frame 3/20 (original index: 172)
Processed frame 4/20 (original index: 238)
Processed frame 5/20 (original index: 245)
Processed frame 6/20 (original index: 250)
Processed frame 7/20 (original index: 280)
Processed frame 8/20 (original index: 308)
Processed frame 9/20 (original index: 323)
Processed frame 10/20 (original index: 364)
Processed frame 11/20 (original index: 423)
Processed frame 12/20 (original index: 447)
Processed frame 13/20 (original index: 528)
Processed frame 14/20 (original index: 604)
Processed frame 15/20 (original index: 706)
Processed frame 16/20 (original index: 773)
Processed frame 17/20 (original index: 879)
Processed frame 18/20 (original index: 970)
Processed frame 19/20 (original index: 1062)
Processed frame 20/20 (original index: 1226)
Completed processing 20 randomly sampled frames.
Extracting audio from /content/vd0.mp

Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transcription complete.
Performing sentiment analysis on audio.wav...
Error during sentiment analysis: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 16.12 MiB is free. Process 211317 has 14.72 GiB memory in use. Of the allocated memory 14.34 GiB is allocated by PyTorch, and 259.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Generating CSM audio for text: ' So favorite movies right? So my favorite movies a...'
CSM audio generated and saved to audio_comment.wav
