Skip to content

Feature/transcipt worker using whisper#7

Merged
vector17002 merged 2 commits intomainfrom
feature/transcipt-worker-using-whisper
Apr 25, 2026
Merged

Feature/transcipt worker using whisper#7
vector17002 merged 2 commits intomainfrom
feature/transcipt-worker-using-whisper

Conversation

@vector17002
Copy link
Copy Markdown
Owner

@vector17002 vector17002 commented Apr 24, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic transcription feature to extract and process audio from uploaded videos.
  • Bug Fixes

    • Enhanced error state tracking across video processing pipelines to provide more consistent failure status updates.
  • Chores

    • Database schema migration to support transcript storage and improved data integrity constraints.
    • Infrastructure updates to container configuration for transcription workload management.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

This pull request introduces an AI-powered transcription worker to the video processing pipeline. It adds a new ai-worker service with Whisper model support, creates a transcription queue and worker, updates the database schema to track transcripts, integrates transcription jobs into the transcode workflow, and adjusts worker status tracking across the pipeline.

Changes

Cohort / File(s) Summary
Infrastructure & Services
docker-compose.yml, server/Dockerfile, server/src/ai-worker.ts
Removes DATABASE_URL from api and worker environments; introduces new ai-worker service with multi-stage Dockerfile build target using Whisper base model (preloaded on build) and Python virtualenv; adds entrypoint for AI worker bootstrap.
Database Schema
server/drizzle/0001_glorious_shooting_star.sql, server/drizzle/meta/0001_snapshot.json, server/drizzle/meta/_journal.json, server/src/models/video.model.ts
Migration adds transcript_key text column and foreign key constraint on videoTable.user_id to userTable.id; updates Drizzle schema definitions and migration journal.
Transcription Pipeline
server/src/services/transcribe.service.ts, server/src/workers/transcribe.queue.ts, server/src/workers/transcribe.worker.ts
New transcription service orchestrates Whisper workflow: downloads original video via presigned S3 URL, extracts mono 16kHz WAV, invokes faster-whisper subprocess, uploads transcript JSON to S3; new transcribeQueue with 3 retries and exponential backoff; new transcribeWorker processes jobs, updates DB status, and handles success/failure events.
Worker Status Tracking
server/src/workers/transcode.worker.ts, server/src/workers/hls.worker.ts, server/src/workers/thumbnail.worker.ts
Transcode worker enqueues transcription job after HLS; all three workers now update overall videoTable.status to 'failed' on failure alongside status\-specific fields.
S3 Path Update
server/src/services/transcode.service.ts
Adjusts presigned URL object path construction from original/${fileId} to ${fileId}/original.

Sequence Diagram

sequenceDiagram
    actor Transcode as Transcode Worker
    participant Queue as Transcribe Queue
    participant Worker as Transcribe Worker
    participant S3 as S3 Storage
    participant FFmpeg as FFmpeg
    participant Python as Whisper Python
    participant Database as PostgreSQL
    
    Transcode->>Queue: Enqueue transcribe job<br/>{fileId, userId}
    Queue->>Worker: Trigger job
    Worker->>Database: Update status = 'processing'
    Worker->>S3: Get presigned URL for<br/>original video
    S3-->>Worker: Presigned URL
    Worker->>Worker: Download video file
    Worker->>FFmpeg: Extract mono 16kHz WAV
    FFmpeg-->>Worker: Audio file path
    Worker->>Python: Run faster-whisper<br/>subprocess with script
    Python-->>Worker: JSON transcript result
    Worker->>S3: Upload transcript JSON
    S3-->>Worker: S3 transcript key
    Worker->>Database: Update transcript_key &<br/>status = 'completed'
    Worker->>Worker: Cleanup temp files
    Worker-->>Queue: Job completed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #1: Modifies server/src/services/transcode.service.ts and server/src/workers/transcode.worker.ts, which are also updated in this PR to enqueue transcription jobs downstream.
  • PR #4: Updates docker-compose.yml and server/Dockerfile to introduce/adjust worker services and build targets, overlapping with infrastructure changes here.
  • PR #2: Modifies server/src/workers/transcode.worker.ts to handle job enqueueing and failure logic, directly related to transcode worker updates in this PR.

Poem

🐰 A whisper across the meadow, transcripts bloom,
From video files to clarity's room,
The AI worker hops through queues with care,
Extracting audio through the digital air,
Where Whisper's wisdom turns sound into text fair! 🎙️✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Feature/transcipt worker using whisper' accurately describes the primary change: adding a new AI transcription worker that uses the Whisper model for transcribing videos.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/transcipt-worker-using-whisper

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker-compose.yml`:
- Around line 51-72: The WHISPER_MODEL value is baked at image build (ARG/ENV)
but can be overridden at runtime via process.env, causing mismatches with
pre-downloaded models; to fix, prevent runtime overrides by removing
WHISPER_MODEL from runtime env sources (remove it from server/.env and from the
docker-compose environment block) so the container uses the build-time ENV only,
and add a short inline comment next to the ai-worker block documenting that
changing WHISPER_MODEL requires rebuilding the image; reference WHISPER_MODEL,
the ai-worker service in docker-compose, and
runWhisper/process.env.WHISPER_MODEL in
server/src/services/transcribe.service.ts to locate relevant code.

In `@server/Dockerfile`:
- Around line 60-68: The Dockerfile ai-worker stage currently installs system
packages but leaves the container running as root and has no HEALTHCHECK; add a
dedicated non-root user and a simple healthcheck: create a user/group (e.g.,
aiuser), add a WORKDIR, ensure required directories are owned by that user
(chown), switch to USER aiuser before finalizing the image, and add a
HEALTHCHECK that runs a lightweight command verifying binaries are present
(e.g., ffmpeg --version and python3 -m venv or python3 -c checks) so the
orchestrator can detect a wedged container; update the ai-worker stage around
the existing RUN that installs ffmpeg/python3 to create the user, set ownership,
and append a HEALTHCHECK instruction.
- Around line 75-85: Pin the faster-whisper dependency and set an explicit
HF_HOME cache path so builds are reproducible and the pre-download step caches
weights in a known location; change the pip install to install a fixed version
(e.g., faster-whisper==1.2.1) and add an ENV or ARG HF_HOME (e.g.,
HF_HOME=/app/.cache/huggingface) before the pre-download RUN, and ensure the
pre-download Python invocation that constructs WhisperModel('${WHISPER_MODEL}',
device='cpu', compute_type='int8') runs with HF_HOME exported so model weights
are stored under $HF_HOME (use the existing VIRTUAL_ENV and WHISPER_MODEL
symbols to locate the relevant blocks).

In `@server/drizzle/0001_glorious_shooting_star.sql`:
- Around line 1-2: The migration currently adds "transcript_key" to videoTable
and immediately adds a foreign key constraint videoTable_user_id_userTable_id_fk
which will validate all rows and take an ACCESS EXCLUSIVE lock; change the FK
addition to a two-step approach by adding the constraint as NOT VALID first and
then running VALIDATE CONSTRAINT videoTable_user_id_userTable_id_fk to avoid
long locks on large tables, and decide whether the delete semantics should be
changed from ON DELETE NO ACTION to ON DELETE CASCADE (or document/implement a
soft-delete alternative) if user deletes must remove related videos—update the
migration to implement the NOT VALID/VALIDATE pattern and adjust the ON DELETE
clause accordingly.

In `@server/src/models/video.model.ts`:
- Line 12: The new foreign key on videoTable (userId referencing userTable.id)
lacks an explicit onDelete setting so Drizzle defaults to NO ACTION; update the
column definition for userId to include references(() => userTable.id, {
onDelete: '<choice>' }) with your chosen behavior: use 'no action' to make the
current behavior explicit, 'cascade' to delete videos when a user is removed, or
'set null' (and remove .notNull() from userId) to orphan videos on user
deletion; also ensure existing videoTable.user_id values reference userTable
rows before running the migration to avoid failures.

In `@server/src/services/transcribe.service.ts`:
- Around line 119-137: The uploadTranscript function currently calls
s3Client.send with Bucket set from process.env.S3_BUCKET_NAME which may be
undefined; add a validation that ensures S3_BUCKET_NAME is present before using
it (preferably at module initialization/startup) and throw a descriptive
Error('S3_BUCKET_NAME is not set') if missing; update the uploadTranscript
function (referencing uploadTranscript, s3Client, and PutObjectCommand) to use
the validated bucket value or rely on the early fail so that no call is made to
the AWS SDK with an undefined Bucket.
- Around line 93-114: The current spawn usage (spawn(PYTHON_BIN, …) -> proc)
lacks an 'error' listener and tries to JSON.parse the entire stdout, which can
hang on synchronous spawn failures and break if libs emit noise to stdout; add a
proc.on('error', (err) => reject(err)) handler to ensure the Promise rejects on
spawn/exec errors, and change the parsing logic in the proc.on('close', ...)
handler to locate the last non-empty line of stdout (trim, split by newline,
take last non-empty) and JSON.parse that line instead of the whole stdout; keep
the existing stderr aggregation and include stderr/last-stdout-line in
parse/reject error messages for debugging (references: spawn, proc, PYTHON_BIN,
stdout, stderr, proc.on('close')).
- Around line 42-53: The current download loads the whole MP4 into memory via
axios.get(..., { responseType: 'arraybuffer' }) and then writes it with
fs.writeFileSync, which can OOM; change the flow in the function that calls
getDownloadUrl/signedUrl to request the file as a stream (axios.get(signedUrl, {
responseType: 'stream' })) and pipe the response.data into a
fs.createWriteStream(videoPath) (ensure downloadsDir exists as you already do),
await the stream's 'finish'/'error' with a Promise before returning videoPath,
and remove the arraybuffer + writeFileSync usage so the file is streamed
directly to disk.

In `@server/src/workers/hls.worker.ts`:
- Around line 68-74: The current hlsWorker.on("failed", ...) handler updates
videoTable.status to 'failed' on every attempt; change it to only mark final
failure after retries are exhausted by checking the job's attempts versus
configured attempts (use job.attemptsMade and job.opts?.attempts or similar) and
only set status: 'failed' when attemptsMade >= job.opts.attempts; otherwise only
update hlsStatus to 'failed' (or leave status unchanged). Update the handler
referenced as hlsWorker.on("failed", async (job, err) => { ... }) and ensure the
DB update uses that conditional logic so transient failures don't permanently
flip videoTable.status.

In `@server/src/workers/thumbnail.worker.ts`:
- Around line 42-48: The failed-event handler thumbnailWorker.on("failed", ...)
is prematurely setting videoTable.status = 'failed' on every failed attempt;
change it to only mark the DB as failed when the job has exhausted retries by
checking job.attemptsMade >= (job.opts.attempts ?? 1) (same pattern used in
transcode.worker.ts). Wrap the existing db.update(...) and
thumbnailStatus/status writes inside that conditional, keeping the console.log
for visibility but avoid updating the video row until the attempts check passes,
and reference the job, job.attemptsMade, job.opts.attempts,
thumbnailWorker.on("failed") and db.update(videoTable) to locate and modify the
code.

In `@server/src/workers/transcode.worker.ts`:
- Around line 98-100: The transcribe job (and similarly hlsQueue.add and
thumbnailQueue.add) can be enqueued multiple times on retries because no
deterministic jobId is provided; update the calls to
transcribeQueue.add("transcribe", { fileId, userId }) to include a deterministic
jobId (e.g. `jobId: \`transcribe:${fileId}\``) so BullMQ will dedupe retries,
and apply the same pattern to hlsQueue.add and thumbnailQueue.add (e.g.
`hls:${fileId}`, `thumbnail:${fileId}`); ensure you pass the jobId option in the
same call site in transcode.worker.ts and keep any existing job data
(fileId/userId) unchanged.

In `@server/src/workers/transcribe.queue.ts`:
- Around line 12-14: The queue options currently set removeOnFail: false which
retains every failed transcription job indefinitely; update the transcribe queue
options (the removeOnFail setting in the queue configuration in
server/src/workers/transcribe.queue.ts) to a bounded retention instead — e.g.,
set removeOnFail to a number (keep last N failed jobs) or to an object with
retention rules (like { count: 100 } or { age: 7 * 24 * 3600 }) so failed jobs
are automatically removed after the bound; keep the symbol name removeOnFail and
adjust only its value in the queue options block.

In `@server/src/workers/transcribe.worker.ts`:
- Around line 14-49: The worker's pipeline (transcribeWorker) can throw before
cleanupTranscribeFiles(videoPath, audioPath) is called, leaking temp files;
refactor the job handler to declare videoPath and audioPath in the outer scope,
wrap the main steps (downloadOriginalForTranscription, extractAudio, runWhisper,
uploadTranscript and DB updates) in a try block and call
cleanupTranscribeFiles(videoPath, audioPath) in a finally block so files are
always removed even on errors; keep the existing failed handler for DB status
updates but do not rely on it for file cleanup.
- Around line 55-60: The failed event handler for transcribeWorker can throw
when job or job.data is undefined; update the handler in
transcribeWorker.on("failed", ...) to guard access to job and job.data.fileId:
first check whether job is present and extract fileId with a safe check (e.g.,
job && job.data && job.data.fileId), log a clear message including err and
job?.id when job is missing or deserialization/stall occurred, and only call
db.update(videoTable).set(...).where(eq(videoTable.id, fileId)) when fileId is
defined; if fileId is missing, log a warning/error instead of issuing an update
so you don't execute eq(videoTable.id, undefined).
- Around line 47-49: The Worker instantiation is using an unnecessary type
assertion "as any" on the redis connection; remove the cast so the real IORedis
instance is passed to the Worker options (replace connection: redis as any with
connection: redis) to preserve type information and let the compiler validate
the IORedis type used by the Worker constructor.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e67d5e68-1cb8-41c6-8bdc-21be632611d2

📥 Commits

Reviewing files that changed from the base of the PR and between 970579d and 26b3b08.

📒 Files selected for processing (14)
  • docker-compose.yml
  • server/Dockerfile
  • server/drizzle/0001_glorious_shooting_star.sql
  • server/drizzle/meta/0001_snapshot.json
  • server/drizzle/meta/_journal.json
  • server/src/ai-worker.ts
  • server/src/models/video.model.ts
  • server/src/services/transcode.service.ts
  • server/src/services/transcribe.service.ts
  • server/src/workers/hls.worker.ts
  • server/src/workers/thumbnail.worker.ts
  • server/src/workers/transcode.worker.ts
  • server/src/workers/transcribe.queue.ts
  • server/src/workers/transcribe.worker.ts

Comment thread docker-compose.yml
Comment thread server/Dockerfile
Comment thread server/Dockerfile
Comment on lines +1 to +2
ALTER TABLE "videoTable" ADD COLUMN "transcript_key" text;--> statement-breakpoint
ALTER TABLE "videoTable" ADD CONSTRAINT "videoTable_user_id_userTable_id_fk" FOREIGN KEY ("user_id") REFERENCES "public"."userTable"("id") ON DELETE no action ON UPDATE no action; No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Confirm FK addition strategy and delete semantics.

Two small things worth confirming before this hits a populated environment:

  1. ADD CONSTRAINT ... FOREIGN KEY ... REFERENCES ... validates every existing row and takes an ACCESS EXCLUSIVE lock on videoTable. On a large table this blocks reads/writes. If non-trivial data already exists in the target environment, prefer a two-step ADD CONSTRAINT ... NOT VALID; followed by VALIDATE CONSTRAINT ...;.
  2. ON DELETE no action means deleting a userTable row with any videos will raise a referential-integrity error instead of cascading. If user deletion is a real flow (GDPR/account deletion), you probably want ON DELETE CASCADE (or an explicit soft-delete/archival path). Please confirm this is the intended semantic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/drizzle/0001_glorious_shooting_star.sql` around lines 1 - 2, The
migration currently adds "transcript_key" to videoTable and immediately adds a
foreign key constraint videoTable_user_id_userTable_id_fk which will validate
all rows and take an ACCESS EXCLUSIVE lock; change the FK addition to a two-step
approach by adding the constraint as NOT VALID first and then running VALIDATE
CONSTRAINT videoTable_user_id_userTable_id_fk to avoid long locks on large
tables, and decide whether the delete semantics should be changed from ON DELETE
NO ACTION to ON DELETE CASCADE (or document/implement a soft-delete alternative)
if user deletes must remove related videos—update the migration to implement the
NOT VALID/VALIDATE pattern and adjust the ON DELETE clause accordingly.

Comment thread server/src/models/video.model.ts
Comment on lines +98 to +100
// Dispatch transcription job — picked up by the ai-worker container
await transcribeQueue.add("transcribe", { fileId, userId });
console.log(`🎙️ Transcription job queued for fileId: ${fileId}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Transcribe job can be enqueued multiple times on transcode retries — use a deterministic jobId.

The worker body is not idempotent across retries: if anything between line 75 (thumbnailQueue.add) and line 103 (the final DB status update) throws, BullMQ will retry the whole handler and this transcribeQueue.add("transcribe", …) will fire again, causing duplicate Whisper runs (expensive) and a double transcriptKey write in videoTable. The same risk exists for the hlsQueue.add at L95 and thumbnailQueue.add at L75, but it's especially painful for transcription.

Add a deterministic jobId so BullMQ deduplicates:

🛠️ Proposed fix
-    // Dispatch transcription job — picked up by the ai-worker container
-    await transcribeQueue.add("transcribe", { fileId, userId });
+    // Dispatch transcription job — picked up by the ai-worker container.
+    // jobId ensures idempotency across transcode retries.
+    await transcribeQueue.add(
+        "transcribe",
+        { fileId, userId },
+        { jobId: `transcribe:${fileId}` }
+    );

Consider the same treatment for hlsQueue.add (jobId: \hls:${fileId}`) and thumbnailQueue.add (jobId: `thumbnail:${fileId}``).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Dispatch transcription job — picked up by the ai-worker container
await transcribeQueue.add("transcribe", { fileId, userId });
console.log(`🎙️ Transcription job queued for fileId: ${fileId}`);
// Dispatch transcription job — picked up by the ai-worker container.
// jobId ensures idempotency across transcode retries.
await transcribeQueue.add(
"transcribe",
{ fileId, userId },
{ jobId: `transcribe:${fileId}` }
);
console.log(`🎙️ Transcription job queued for fileId: ${fileId}`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/workers/transcode.worker.ts` around lines 98 - 100, The transcribe
job (and similarly hlsQueue.add and thumbnailQueue.add) can be enqueued multiple
times on retries because no deterministic jobId is provided; update the calls to
transcribeQueue.add("transcribe", { fileId, userId }) to include a deterministic
jobId (e.g. `jobId: \`transcribe:${fileId}\``) so BullMQ will dedupe retries,
and apply the same pattern to hlsQueue.add and thumbnailQueue.add (e.g.
`hls:${fileId}`, `thumbnail:${fileId}`); ensure you pass the jobId option in the
same call site in transcode.worker.ts and keep any existing job data
(fileId/userId) unchanged.

Comment thread server/src/workers/transcribe.queue.ts
Comment on lines +14 to +49
export const transcribeWorker = new Worker("transcribeQueue", async (job: Job) => {
const { fileId, userId } = job.data as { fileId: string; userId: string };

console.log(`\n🎙️ Transcription job ${job.id} started for fileId: ${fileId}`);

// Mark as processing
await db.update(videoTable).set({
transcriptStatus: 'processing'
}).where(eq(videoTable.id, fileId));

// 1. Download original video from S3
const videoPath = await downloadOriginalForTranscription(fileId, userId, job.id!);

// 2. Extract audio (16kHz mono WAV — Whisper's preferred format)
const audioPath = await extractAudio(videoPath);

// 3. Run faster-whisper
console.log(`⏳ Running Whisper on ${audioPath} (model: ${process.env.WHISPER_MODEL ?? 'base'})...`);
const transcript = await runWhisper(audioPath);
console.log(`✅ Transcription complete — language: ${transcript.language}, duration: ${transcript.duration}s, segments: ${transcript.segments.length}`);

// 4. Upload transcript JSON to S3
const transcriptKey = await uploadTranscript(transcript, fileId, userId);

// 5. Persist the S3 key in the DB
await db.update(videoTable).set({
transcriptKey,
transcriptStatus: 'completed',
}).where(eq(videoTable.id, fileId));

// 6. Clean up local temp files
cleanupTranscribeFiles(videoPath, audioPath);

}, {
connection: redis as any,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Temp files leak on any pipeline failure — wrap in try/finally.

If downloadOriginalForTranscription, extractAudio, runWhisper, or uploadTranscript rejects, execution returns immediately (the error propagates to BullMQ), and cleanupTranscribeFiles(videoPath, audioPath) on line 45 is never reached. The downloaded MP4 and extracted WAV remain in /app/downloads/transcribe/ forever. With a 2 GB memory-capped container and retried jobs, this will fill the container disk quickly and wedge the worker.

The failed handler on lines 55–60 only updates the DB — it cannot clean up temp files because videoPath / audioPath are out of scope there.

🛡️ Proposed fix — ensure cleanup always runs
 export const transcribeWorker = new Worker("transcribeQueue", async (job: Job) => {
     const { fileId, userId } = job.data as { fileId: string; userId: string };

     console.log(`\n🎙️  Transcription job ${job.id} started for fileId: ${fileId}`);

     // Mark as processing
     await db.update(videoTable).set({
         transcriptStatus: 'processing'
     }).where(eq(videoTable.id, fileId));

-    // 1. Download original video from S3
-    const videoPath = await downloadOriginalForTranscription(fileId, userId, job.id!);
-
-    // 2. Extract audio (16kHz mono WAV — Whisper's preferred format)
-    const audioPath = await extractAudio(videoPath);
-
-    // 3. Run faster-whisper
-    console.log(`⏳ Running Whisper on ${audioPath} (model: ${process.env.WHISPER_MODEL ?? 'base'})...`);
-    const transcript = await runWhisper(audioPath);
-    console.log(`✅ Transcription complete — language: ${transcript.language}, duration: ${transcript.duration}s, segments: ${transcript.segments.length}`);
-
-    // 4. Upload transcript JSON to S3
-    const transcriptKey = await uploadTranscript(transcript, fileId, userId);
-
-    // 5. Persist the S3 key in the DB
-    await db.update(videoTable).set({
-        transcriptKey,
-        transcriptStatus: 'completed',
-    }).where(eq(videoTable.id, fileId));
-
-    // 6. Clean up local temp files
-    cleanupTranscribeFiles(videoPath, audioPath);
-
+    let videoPath: string | undefined;
+    let audioPath: string | undefined;
+    try {
+        videoPath = await downloadOriginalForTranscription(fileId, userId, job.id!);
+        audioPath = await extractAudio(videoPath);
+
+        console.log(`⏳ Running Whisper on ${audioPath} (model: ${process.env.WHISPER_MODEL ?? 'base'})...`);
+        const transcript = await runWhisper(audioPath);
+        console.log(`✅ Transcription complete — language: ${transcript.language}, duration: ${transcript.duration}s, segments: ${transcript.segments.length}`);
+
+        const transcriptKey = await uploadTranscript(transcript, fileId, userId);
+
+        await db.update(videoTable).set({
+            transcriptKey,
+            transcriptStatus: 'completed',
+        }).where(eq(videoTable.id, fileId));
+    } finally {
+        if (videoPath || audioPath) {
+            cleanupTranscribeFiles(videoPath ?? '', audioPath ?? '');
+        }
+    }
 }, {
     connection: redis as any,
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/workers/transcribe.worker.ts` around lines 14 - 49, The worker's
pipeline (transcribeWorker) can throw before cleanupTranscribeFiles(videoPath,
audioPath) is called, leaking temp files; refactor the job handler to declare
videoPath and audioPath in the outer scope, wrap the main steps
(downloadOriginalForTranscription, extractAudio, runWhisper, uploadTranscript
and DB updates) in a try block and call cleanupTranscribeFiles(videoPath,
audioPath) in a finally block so files are always removed even on errors; keep
the existing failed handler for DB status updates but do not rely on it for file
cleanup.

Comment on lines +47 to +49
}, {
connection: redis as any,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the shape of the exported redis client
fd -t f 'redis.ts' -p 'config'
rg -nP --type=ts -C3 'export\s+(const|default)\s+redis\b'

Repository: vector17002/video-transcoding

Length of output: 366


🏁 Script executed:

cat -n server/src/config/redis.ts

Repository: vector17002/video-transcoding

Length of output: 314


🏁 Script executed:

sed -n '40,60p' server/src/workers/transcribe.worker.ts

Repository: vector17002/video-transcoding

Length of output: 687


🏁 Script executed:

rg -A5 -B5 'new Queue' server/src/workers/transcribe.worker.ts

Repository: vector17002/video-transcoding

Length of output: 55


🏁 Script executed:

head -50 server/src/workers/transcribe.worker.ts

Repository: vector17002/video-transcoding

Length of output: 1881


🏁 Script executed:

rg -n 'import.*Queue|new Worker|new Queue' server/src/workers/transcribe.worker.ts -A2

Repository: vector17002/video-transcoding

Length of output: 246


🏁 Script executed:

rg -n 'ConnectionOptions|connection:' server/src -t ts -B2 -A2

Repository: vector17002/video-transcoding

Length of output: 2470


🏁 Script executed:

# Check package.json to see BullMQ version
rg '"bullmq"' server/package.json

Repository: vector17002/video-transcoding

Length of output: 98


🏁 Script executed:

rg 'new Worker|new Queue' server/src -t ts -A3 | head -50

Repository: vector17002/video-transcoding

Length of output: 2688


🏁 Script executed:

# Check if there's a tsconfig or tsconfig.json that shows strict checking enabled
cat tsconfig.json | head -20

Repository: vector17002/video-transcoding

Length of output: 119


🏁 Script executed:

# Check BullMQ type definitions for what it accepts
rg -i 'connection.*redis|ConnectionOptions' server/src/workers -t ts | head -20

Repository: vector17002/video-transcoding

Length of output: 611


🏁 Script executed:

# Check if there's a server-level tsconfig
find server -name "tsconfig*.json" -type f

Repository: vector17002/video-transcoding

Length of output: 94


🏁 Script executed:

# Look for any imports of ConnectionOptions or type definitions used with connection option
rg -n 'ConnectionOptions|WorkerOptions|QueueOptions' server/src -t ts

Repository: vector17002/video-transcoding

Length of output: 55


🏁 Script executed:

cat server/tsconfig.json

Repository: vector17002/video-transcoding

Length of output: 1215


🏁 Script executed:

# Check if there are any node_modules or type stubs that would show BullMQ's expected types
# Look for any comments or documentation about the connection parameter
rg -n 'BullMQ|Worker|Queue' server/src/workers/transcribe.worker.ts -B5 | head -30

Repository: vector17002/video-transcoding

Length of output: 646


🏁 Script executed:

# Search for any TypeScript error suppressions or type assertions related to redis/connection
rg -n '@ts-' server/src/workers -t ts | grep -i redis

Repository: vector17002/video-transcoding

Length of output: 55


Remove the as any cast on the redis connection.

The redis import is already an IORedis instance (instantiated in server/src/config/redis.ts), which BullMQ's Worker accepts directly for the connection option. The as any assertion is unnecessary and masks type information that the type checker should verify.

Code context
}, {
    connection: redis as any,
});

Change to:

}, {
    connection: redis,
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/workers/transcribe.worker.ts` around lines 47 - 49, The Worker
instantiation is using an unnecessary type assertion "as any" on the redis
connection; remove the cast so the real IORedis instance is passed to the Worker
options (replace connection: redis as any with connection: redis) to preserve
type information and let the compiler validate the IORedis type used by the
Worker constructor.

Comment on lines +55 to +60
transcribeWorker.on("failed", async (job, err) => {
console.error(`❌ Transcription job ${job?.id} failed: ${err.message}`);
await db.update(videoTable).set({
transcriptStatus: 'failed',
}).where(eq(videoTable.id, job?.data.fileId));
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

failed handler can itself crash — NPE on job?.data.fileId.

BullMQ does not guarantee job is defined in the failed event (e.g. stalled jobs that were removed, or deserialization errors). When job is undefined, job?.data is undefined, and then .fileId throws TypeError: Cannot read properties of undefined (reading 'fileId'), which escapes the handler and is logged as an unhandled rejection — masking the original failure.

Secondarily, if fileId ends up undefined, eq(videoTable.id, undefined) will execute an UPDATE with a bound null/undefined that matches no rows — silently succeeding but not doing what you think.

🛡️ Proposed fix
 transcribeWorker.on("failed", async (job, err) => {
     console.error(`❌ Transcription job ${job?.id} failed: ${err.message}`);
-    await db.update(videoTable).set({
-        transcriptStatus: 'failed',
-    }).where(eq(videoTable.id, job?.data.fileId));
+    const fileId = (job?.data as { fileId?: string } | undefined)?.fileId;
+    if (!fileId) {
+        console.warn(`⚠️  failed handler: missing fileId on job ${job?.id}; skipping DB update`);
+        return;
+    }
+    await db.update(videoTable).set({
+        transcriptStatus: 'failed',
+    }).where(eq(videoTable.id, fileId));
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/src/workers/transcribe.worker.ts` around lines 55 - 60, The failed
event handler for transcribeWorker can throw when job or job.data is undefined;
update the handler in transcribeWorker.on("failed", ...) to guard access to job
and job.data.fileId: first check whether job is present and extract fileId with
a safe check (e.g., job && job.data && job.data.fileId), log a clear message
including err and job?.id when job is missing or deserialization/stall occurred,
and only call db.update(videoTable).set(...).where(eq(videoTable.id, fileId))
when fileId is defined; if fileId is missing, log a warning/error instead of
issuing an update so you don't execute eq(videoTable.id, undefined).

@vector17002 vector17002 merged commit f702774 into main Apr 25, 2026
1 check passed
@coderabbitai coderabbitai Bot mentioned this pull request Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant