### **RAG BASED TEACHING AI ASSISTANT**


**🧠 Cell 1 Description:**

This cell sets up the environment for offline speech recognition using the Vosk library. It installs all necessary dependencies such as vosk, soundfile, pydub, tqdm, and ffmpeg, ensuring that audio files can be processed and transcribed locally without relying on online APIs.

In [None]:
# ✅ Cell 1: Install and configure Vosk (offline speech recognition)


# Upgrade pip and install required packages
!pip install -q --upgrade pip

# Install Vosk (offline speech-to-text) and helpers
!pip install -q vosk soundfile pydub tqdm ffmpeg-python

# Ensure ffmpeg is available
!which ffmpeg || (apt-get update -qq && apt-get install -y -qq ffmpeg)

!pip install -q sentence-transformers

print("\n✅ Installed Vosk + dependencies successfully.")
print("You can now run the next cell to download a model and transcribe audio files.")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.8 MB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m1.0/1.8 MB[0m [31m12.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for srt (pyproject.toml) ... [?25l[?25hdone
/usr/bin/ffmpeg

✅ Installed Vosk + dependencies successfully.
You can now run the next cell to download a model and transcribe audio fil

**📂 Cell 2 Description:**

This cell mounts your Google Drive and automatically imports a project folder from a shared Drive URL or folder ID into the Colab environment. It authenticates with Google, retrieves files (including Google Docs/Sheets via export), and downloads them recursively into a local directory (/content/RAG-Based-AI-Teaching-Assistant). This ensures your entire project is accessible within Colab for further processing or execution.

In [None]:
# Cell 2 (updated) - Mount Drive and import project from a Drive folder URL or ID
from google.colab import drive, auth
import os, re, io
from pathlib import Path
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload

# 1) Mount Drive
drive.mount('/content/drive', force_remount=True)

# 2) Helper: extract folder id from a Drive URL or accept an ID
def extract_folder_id(s: str):
    # common URL forms:
    # https://drive.google.com/drive/folders/<ID>?...
    # https://drive.google.com/drive/u/0/folders/<ID>
    m = re.search(r'folders/([a-zA-Z0-9_\-]+)', s)
    if m:
        return m.group(1)
    # if user pasted only the id
    if re.fullmatch(r'[a-zA-Z0-9_\-]+', s):
        return s
    return None

folder_input = "https://drive.google.com/drive/folders/16D3GyqZNQIVgtyVOHVgatWLg36aepm8e?usp=drive_link"

FOLDER_ID = extract_folder_id(folder_input)
if not FOLDER_ID:
    raise SystemExit("Couldn't parse a folder ID from folder_input. Put the full folder URL or the folder ID in folder_input variable.")

print("Parsed folder id:", FOLDER_ID)

# 3) Try to find the folder under /content/drive/MyDrive (if user added shortcut or it's in My Drive)
def find_path_in_mydrive_by_name(folder_id):
    # Quick heuristic scan: look for folders with this ID in MyDrive using file metadata is not available
    # Instead try to find a folder with the same folder name if it exists
    # This is best-effort; if file is not in MyDrive we'll fallback to Drive API download.
    base = Path("/content/drive/MyDrive")
    if not base.exists():
        return None
    # Walk a little and try to match folder id in the folder url metadata is not available on FUSE,
    # so we can't reliably map ID -> path. Return None to force Drive API fallback.
    return None

mydrive_path = find_path_in_mydrive_by_name(FOLDER_ID)
if mydrive_path:
    PROJECT_DIR = str(mydrive_path)
    print("Found folder in My Drive at:", PROJECT_DIR)
else:
    print("Folder not found in My Drive FUSE path — will use Drive API to copy files locally.")

    # 4) Authenticate and build Drive API client
    auth.authenticate_user()
    drive_service = build('drive', 'v3')

    # 5) Create local target dir
    LOCAL_PROJECT_DIR = "/content/RAG-Based-AI-Teaching-Assistant"
    Path(LOCAL_PROJECT_DIR).mkdir(parents=True, exist_ok=True)

    # 6) Recursively download folder contents from Drive folder id -> LOCAL_PROJECT_DIR
    def download_file(file_id, dest_path, mimeType=None):
        """Download a regular file to dest_path. Handles binary content."""
        request = drive_service.files().get_media(fileId=file_id)
        fh = io.FileIO(dest_path, mode='wb')
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while not done:
            status, done = downloader.next_chunk()
            # optional: print(f"Download {dest_path}: {int(status.progress()*100)}%")
        fh.close()

    def export_google_doc(file_id, dest_path, mimeType='text/plain'):
        """Export Google Docs/Sheets/... to a usable format (text/plain/pdf)"""
        request = drive_service.files().export_media(fileId=file_id, mimeType=mimeType)
        fh = io.FileIO(dest_path, mode='wb')
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while not done:
            status, done = downloader.next_chunk()
        fh.close()

    def list_children(folder_id):
        files = []
        page_token = None
        query = f"'{folder_id}' in parents and trashed = false"
        while True:
            res = drive_service.files().list(q=query,
                                            spaces='drive',
                                            fields='nextPageToken, files(id, name, mimeType)',
                                            pageToken=page_token).execute()
            items = res.get('files', [])
            files.extend(items)
            page_token = res.get('nextPageToken', None)
            if not page_token:
                break
        return files

    def download_folder_recursive(folder_id, dest_dir):
        items = list_children(folder_id)
        for it in items:
            fid = it['id']
            name = it['name']
            mime = it.get('mimeType', '')
            safe_name = name.replace('/', '_')
            dest = os.path.join(dest_dir, safe_name)
            if mime == 'application/vnd.google-apps.folder':
                os.makedirs(dest, exist_ok=True)
                print("Creating folder:", dest)
                download_folder_recursive(fid, dest)
            elif mime.startswith('application/vnd.google-apps'):
                # Google-native file (Docs/Sheets/Slides). Try exporting as plain text or PDF.
                print("Exporting Google native file:", name, "->", dest + ".txt")
                try:
                    export_google_doc(fid, dest + ".txt", mimeType='text/plain')
                except Exception as e:
                    print("  export as text failed, trying pdf:", e)
                    try:
                        export_google_doc(fid, dest + ".pdf", mimeType='application/pdf')
                    except Exception as e2:
                        print("  export failed:", e2)
            else:
                # Regular file -> download
                print("Downloading file:", name, "->", dest)
                try:
                    download_file(fid, dest, mimeType=mime)
                except Exception as e:
                    print("  download error:", e)

    print("Downloading contents of Drive folder to:", LOCAL_PROJECT_DIR)
    download_folder_recursive(FOLDER_ID, LOCAL_PROJECT_DIR)
    PROJECT_DIR = LOCAL_PROJECT_DIR

# 7) Change working directory to project dir & list files
print("Using PROJECT_DIR =", PROJECT_DIR)
os.chdir(PROJECT_DIR)
print("Current working directory:")
!pwd
print("\nFiles in project directory (top-level):")
!ls -la | sed -n '1,120p'


Mounted at /content/drive
Parsed folder id: 16D3GyqZNQIVgtyVOHVgatWLg36aepm8e
Folder not found in My Drive FUSE path — will use Drive API to copy files locally.
Downloading contents of Drive folder to: /content/RAG-Based-AI-Teaching-Assistant
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git/logs
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs/remotes
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs/remotes/origin
Downloading file: HEAD -> /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs/remotes/origin/HEAD
Creating folder: /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs/heads
Downloading file: main -> /content/RAG-Based-AI-Teaching-Assistant/.git/logs/refs/heads/main
Downloading file: HEAD -> /content/RAG-Based-AI-Teaching-Assistant/.git/logs/HEAD
Creating folder: /content/RA

**📁 Cell 3 Description:**

This cell ensures that the required project directories — videos, audios, and jsons — exist in the current workspace. It then lists the contents of these folders along with the root directory, helping you verify that all necessary files are correctly organized before proceeding with audio extraction or transcription steps.

In [None]:
# Cell 3: ensure expected folders exist and list key files
import os
os.makedirs("videos", exist_ok=True)
os.makedirs("audios", exist_ok=True)
os.makedirs("jsons", exist_ok=True)

print("videos folder listing:")
!ls -la videos || true
print("\naudios folder listing:")
!ls -la audios || true
print("\nroot files:")
!ls -la


videos folder listing:
total 8
drwxr-xr-x 2 root root 4096 Nov  1 08:32 .
drwxr-xr-x 7 root root 4096 Nov  1 08:32 ..

audios folder listing:
total 8
drwxr-xr-x 2 root root 4096 Nov  1 08:32 .
drwxr-xr-x 7 root root 4096 Nov  1 08:32 ..

root files:
total 52
drwxr-xr-x 7 root root 4096 Nov  1 08:32 .
drwxr-xr-x 1 root root 4096 Nov  1 08:18 ..
drwxr-xr-x 2 root root 4096 Nov  1 08:32 audios
drwxr-xr-x 7 root root 4096 Nov  1 08:19 .git
drwxr-xr-x 2 root root 4096 Nov  1 08:32 jsons
-rw-r--r-- 1 root root  925 Nov  1 08:31 mp3_to_json.py
-rw-r--r-- 1 root root 1045 Nov  1 08:31 preprocess_json.py
-rw-r--r-- 1 root root 2315 Nov  1 08:31 process_incoming.py
-rw-r--r-- 1 root root 2992 Nov  1 08:31 README.md
drwxr-xr-x 4 root root 4096 Nov  1 08:31 .venv
drwxr-xr-x 2 root root 4096 Nov  1 08:32 videos
-rw-r--r-- 1 root root  345 Nov  1 08:31 video_to_mp3.py


**🎥 Cell 4 Description:**

This cell automatically converts all video files in the videos/ folder into MP3 audio files using FFmpeg. It ensures a clean output by handling filenames safely and saving the resulting audio files into the audios/ directory. This step is essential for preparing your video lectures or tutorials for offline transcription using the Vosk speech recognition model in later cells.

In [None]:
# Cell 4: Convert all videos in videos/ -> audios/ using ffmpeg
import os, shlex, subprocess, pathlib

VIDEO_DIR = "videos"
AUDIO_DIR = "audios"
os.makedirs(AUDIO_DIR, exist_ok=True)

def to_mp3(vpath, outdir=AUDIO_DIR):
    p = pathlib.Path(vpath)
    safe_name = p.stem.replace(" ", "_").replace("[","").replace("]","").replace("#","")
    out = os.path.join(outdir, f"{safe_name}.mp3")
    cmd = f'ffmpeg -y -i {shlex.quote(str(vpath))} -vn -acodec libmp3lame -q:a 2 {shlex.quote(out)}'
    print("Running:", cmd)
    subprocess.run(cmd, shell=True, check=True)
    return out

videos = [os.path.join(VIDEO_DIR, f) for f in os.listdir(VIDEO_DIR) if not f.startswith(".")]
if len(videos) == 0:
    print("No videos found in", VIDEO_DIR, "- add mp4 files there (or upload).")
else:
    for v in videos:
        try:
            mp3 = to_mp3(v)
            print("Wrote:", mp3)
        except Exception as e:
            print("Error converting", v, e)


Running: ffmpeg -y -i videos/Lecture_1_Black_Hole.mp4 -vn -acodec libmp3lame -q:a 2 audios/Lecture_1_Black_Hole.mp3
Wrote: audios/Lecture_1_Black_Hole.mp3


**🗣️ Cell 5 Description:**

This cell performs automatic transcription of all .mp3 files in the audios/ folder using the Vosk offline speech recognition model.
It first ensures the Vosk model is available (and downloads it if missing), then converts each MP3 to a 16 kHz mono WAV (the required input format). The audio is processed in chunks, and the recognized text is stored in structured JSON files under the jsons/ directory.
Each JSON file contains the full transcript, word-level timing data, and model details — forming the foundation for later embedding or retrieval-based analysis.

In [None]:
# ✅ Cell 5: Transcribe all mp3 files in audios/ → jsons/ using Vosk
import os, json, pathlib, subprocess, shlex
from pathlib import Path
from tqdm import tqdm

# Folders
AUDIO_DIR = "audios"
JSON_DIR = "jsons"
MODEL_DIR = "vosk_model"

os.makedirs(JSON_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

# Attempt to download a small English model if not present
MODEL_PATH = Path(MODEL_DIR)
if not any(MODEL_PATH.iterdir()):
    print("No Vosk model found, downloading small English model (if internet available)...")
    try:
        model_url = "https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip"
        zip_path = "/tmp/vosk_model.zip"
        subprocess.run(f"wget -q -O {zip_path} {model_url}", shell=True, check=True)
        subprocess.run(f"unzip -q {zip_path} -d {MODEL_DIR}", shell=True, check=True)
        # Move contents up one level if nested
        inner = next(Path(MODEL_DIR).glob("vosk-model-*"), None)
        if inner:
            for f in inner.iterdir():
                subprocess.run(f"mv {f} {MODEL_DIR}/", shell=True)
            subprocess.run(f"rm -rf {inner}", shell=True)
        print("✅ Model downloaded and extracted to", MODEL_DIR)
    except Exception as e:
        print("⚠️ Could not download Vosk model automatically:", e)
        print("→ Please manually upload a model folder into", MODEL_DIR)

# Import vosk
from vosk import Model, KaldiRecognizer
import wave

# Load model
print("Loading Vosk model from:", MODEL_DIR)
model = Model(MODEL_DIR)

# Helper: convert mp3 → 16 kHz mono WAV (Vosk requires WAV)
def mp3_to_wav_16k(src_mp3, dst_wav):
    cmd = f'ffmpeg -y -i {shlex.quote(src_mp3)} -ac 1 -ar 16000 -vn {shlex.quote(dst_wav)}'
    subprocess.run(cmd, shell=True, check=True)

# Process each MP3
mp3_files = sorted(Path(AUDIO_DIR).glob("*.mp3"))
if not mp3_files:
    print("⚠️ No mp3 files found in", AUDIO_DIR)
else:
    for mp3 in tqdm(mp3_files, desc="Transcribing MP3s"):
        try:
            wav_path = Path("/tmp") / (mp3.stem + "_16k.wav")
            mp3_to_wav_16k(str(mp3), str(wav_path))

            wf = wave.open(str(wav_path), "rb")
            rec = KaldiRecognizer(model, 16000)
            rec.SetWords(True)

            results = []
            while True:
                data = wf.readframes(4000)
                if len(data) == 0:
                    break
                if rec.AcceptWaveform(data):
                    part = json.loads(rec.Result())
                    results.append(part)
            results.append(json.loads(rec.FinalResult()))
            wf.close()

            text = " ".join([r.get("text", "") for r in results]).strip()
            out = {
                "file": mp3.name,
                "text": text,
                "segments": results,
                "model": "vosk-small-en-us-0.15"
            }

            out_path = Path(JSON_DIR) / (mp3.stem + ".json")
            with open(out_path, "w", encoding="utf-8") as fh:
                json.dump(out, fh, ensure_ascii=False, indent=2)
            print("✅ Saved transcript:", out_path)
        except Exception as e:
            print("❌ Error transcribing", mp3, ":", e)


No Vosk model found, downloading small English model (if internet available)...
✅ Model downloaded and extracted to vosk_model
Loading Vosk model from: vosk_model


Transcribing MP3s: 100%|██████████| 1/1 [00:02<00:00,  2.50s/it]

✅ Saved transcript: jsons/Lecture_1_Black_Hole.json





**🧹Cell 6 Description:**

This cell performs **basic preprocessing and cleanup** of the raw transcription JSON files generated by Vosk.
It reads each JSON file from the `jsons/` directory, removes unnecessary metadata, and keeps only the essential fields — the filename, transcribed text, segments, and language (if available).
The cleaned and lightweight versions are then saved into a new folder, `jsons/clean`, making them ready for **faster downstream processing**, such as text embedding or retrieval-based question answering.


In [None]:
# Cell 6: Minimal preprocessing fallback - creates cleaned JSONs in jsons/clean_
import os, json, glob
from pathlib import Path

INPUT_DIR = "jsons"
OUT_DIR = "jsons/clean"
os.makedirs(OUT_DIR, exist_ok=True)

for f in Path(INPUT_DIR).glob("*.json"):
    try:
        j = json.load(open(f, "r", encoding="utf-8"))
        cleaned = {
            "file": f.name,
            "text": j.get("text", ""),
            "segments": j.get("segments", []),
            "language": j.get("language", "")
        }
        out = Path(OUT_DIR) / f.name
        json.dump(cleaned, open(out, "w", encoding="utf-8"), ensure_ascii=False, indent=2)
        print("Cleaned ->", out)
    except Exception as e:
        print("Skip", f, e)


Cleaned -> jsons/clean/Lecture_1_Black_Hole.json


🧬 **Cell 7 Description:**

This cell builds (or updates) your **embeddings dataset** from the cleaned transcripts in `jsons/clean`. It assembles a DataFrame of text chunks, **reuses any existing embeddings** from `embeddings.joblib`, computes only the **missing vectors** using `SentenceTransformer` (`all-MiniLM-L6-v2`), saves the result back to `embeddings.joblib`, and exposes it as `embeddings_df` for downstream RAG queries.


In [None]:
# Cell 7 - Ensure embeddings.joblib exists (create from jsons/clean if missing)
# Run this cell in the project root where jsons/clean and your scripts live.

# 0) Install dependencies (Colab-friendly). Comment out if already installed.
!pip install -q sentence-transformers joblib pandas tqdm

# --- Begin Python logic ---
import os, glob, json, joblib, math
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

EMB_PATH = "embeddings.joblib"
CLEAN_JSON_DIR = "jsons/clean"   # change if your cleaned JSONs are elsewhere
MODEL_NAME = "all-MiniLM-L6-v2"  # small and fast; change if desired
BATCH_SIZE = 64                  # memory vs speed tradeoff

def load_json_chunks(jpath):
    """Return list of dicts with at least 'text' (and optionally start/end)."""
    with open(jpath, "r", encoding="utf-8") as f:
        data = json.load(f)
    # Common structures: list of chunks OR dict with 'chunks' OR dict with 'segments'/'results'
    if isinstance(data, list):
        chunks = data
    elif isinstance(data, dict) and "chunks" in data and isinstance(data["chunks"], list):
        chunks = data["chunks"]
    elif isinstance(data, dict) and ("segments" in data or "results" in data):
        chunks = data.get("segments", []) or data.get("results", [])
    else:
        # fallback: treat top-level dict as a single chunk if it has 'text' or 'transcript'
        if isinstance(data, dict) and ("text" in data or "transcript" in data):
            chunks = [data]
        else:
            chunks = []
    # Normalize each chunk to a dict with text/start/end
    out = []
    for c in chunks:
        if not isinstance(c, dict):
            continue
        text = c.get("text") or c.get("transcript") or ""
        text = text.strip()
        if not text:
            continue
        start = c.get("start") or c.get("start_time") or c.get("start_sec") or None
        end   = c.get("end")   or c.get("end_time")   or c.get("end_sec")   or None
        out.append({"text": text, "start": start, "end": end})
    return out

def build_dataframe_from_jsons(json_dir):
    files = sorted(glob.glob(os.path.join(json_dir, "*.json")))
    if not files:
        raise FileNotFoundError(f"No JSON files found in {json_dir}. Check path.")
    rows = []
    for jf in files:
        chunks = load_json_chunks(jf)
        if not chunks:
            # If load_json_chunks returned nothing, attempt simple fallback
            try:
                with open(jf, "r", encoding="utf-8") as f:
                    raw = f.read().strip()
                    if raw:
                        rows.append({"file": os.path.basename(jf), "text": raw, "start": None, "end": None})
            except Exception:
                pass
            continue
        for c in chunks:
            rows.append({
                "file": os.path.basename(jf),
                "text": c["text"],
                "start": c["start"],
                "end": c["end"]
            })
    df = pd.DataFrame(rows)
    if df.empty:
        raise RuntimeError("No transcript text extracted from JSONs.")
    df = df.reset_index(drop=True)
    df["id"] = df.index.astype(str)
    return df

# 1) If embeddings.joblib exists, load it and try to preserve embeddings
if os.path.exists(EMB_PATH):
    try:
        print("Loading existing", EMB_PATH)
        existing = joblib.load(EMB_PATH)
        if not isinstance(existing, pd.DataFrame):
            print("Warning: existing embeddings.joblib not a DataFrame — will overwrite.")
            existing = None
    except Exception as e:
        print("Failed to load existing embeddings.joblib:", e)
        existing = None
else:
    existing = None

# 2) Build dataframe from cleaned jsons
print("Building transcript DataFrame from JSONs in:", CLEAN_JSON_DIR)
df = build_dataframe_from_jsons(CLEAN_JSON_DIR)

# 3) If existing DF present and has 'id' and 'embedding', merge to keep embeddings
if existing is not None and "id" in existing.columns:
    # Align by 'id' if possible, otherwise fallback to merging by file+text
    if set(existing["id"].astype(str)).issuperset(set(df["id"].astype(str))):
        # simple replacement: take embeddings from existing where id matches
        existing_idxed = existing.set_index(existing["id"].astype(str))
        df = df.set_index(df["id"].astype(str))
        df["embedding"] = existing_idxed["embedding"]
        df = df.reset_index(drop=True)
        # reassign id as string index
        df["id"] = df.index.astype(str)
        print("Merged existing embeddings by id.")
    else:
        # fallback: merge on file+text (slower but robust)
        merged = df.merge(existing[["file","text","embedding"]], on=["file","text"], how="left")
        df = merged.rename(columns={"embedding": "embedding"}).copy()
        df["id"] = df.index.astype(str)
        print("Merged existing embeddings by file+text where possible.")
else:
    # ensure embedding column exists
    df["embedding"] = None

# 4) Find which rows need embeddings
missing_mask = df["embedding"].isna() | df["embedding"].apply(lambda x: x is None)
to_compute_texts = df.loc[missing_mask, "text"].astype(str).tolist()
print(f"Total rows: {len(df)} | Missing embeddings: {len(to_compute_texts)}")

# 5) Compute embeddings if needed
if len(to_compute_texts) > 0:
    print("Loading SBERT model:", MODEL_NAME)
    model = SentenceTransformer(MODEL_NAME)
    # We'll compute embeddings in batches to avoid memory spikes
    n = len(to_compute_texts)
    batches = math.ceil(n / BATCH_SIZE)
    embeddings = []
    for i in tqdm(range(batches), desc="embedding batches"):
        start = i * BATCH_SIZE
        end = min((i+1) * BATCH_SIZE, n)
        batch_texts = to_compute_texts[start:end]
        emb_batch = model.encode(batch_texts, show_progress_bar=False, convert_to_numpy=True, batch_size=BATCH_SIZE)
        embeddings.append(emb_batch)
    import numpy as np
    embeddings = np.vstack(embeddings)
    # Put embeddings back into df
    idxs = df.loc[missing_mask].index.tolist()
    if len(idxs) != embeddings.shape[0]:
        raise RuntimeError("Mismatch between indices to fill and number of embeddings computed.")
    for i, idx in enumerate(idxs):
        df.at[idx, "embedding"] = embeddings[i].tolist()
    # Save to joblib
    joblib.dump(df, EMB_PATH)
    print("✅ Saved embeddings.joblib with", len(df), "rows.")
else:
    print("No missing embeddings. Using existing embeddings.joblib as-is.")

# 6) Quick sanity checks and expose df as embeddings_df to be used by downstream cells
embeddings_df = df  # name downstream code can use
print("embeddings_df ready. Sample columns:", embeddings_df.columns.tolist())
if embeddings_df.loc[embeddings_df["embedding"].notna()].shape[0] > 0:
    sample_emb = embeddings_df.loc[embeddings_df["embedding"].notna(), "embedding"].iloc[0]
    print("Sample embedding length:", len(sample_emb))
else:
    print("Warning: no embeddings present after run.")


Building transcript DataFrame from JSONs in: jsons/clean
Total rows: 2 | Missing embeddings: 2
Loading SBERT model: all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

embedding batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Saved embeddings.joblib with 2 rows.
embeddings_df ready. Sample columns: ['file', 'text', 'start', 'end', 'id', 'embedding']
Sample embedding length: 384


In [None]:
import os
os.environ["GROQ_API_KEY"] = "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"


🤖 **Cell 8 Description:**

This cell enables **Retrieval-Augmented Generation (RAG)** using either **Groq** or **Gemini** as the language model backend.
It loads the precomputed **embeddings** into a **FAISS** index for fast similarity search, retrieves the most relevant text chunks for each query, and then feeds them into a powerful LLM (Groq or Gemini) to generate **context-aware answers**.
If neither API key is set, it prompts you to configure one. Once active, you can **interactively ask questions**, and the assistant will respond using only the provided course or lecture content — ideal for creating an **AI-powered teaching assistant** experience.


In [1]:
# ===========================
# ✅ CELL 8 — RAG with Groq or Gemini (auto-pick model)
# ===========================
!pip -q install faiss-cpu sentence-transformers joblib groq google-generativeai

import os, numpy as np, joblib, faiss, sys, traceback
from sentence_transformers import SentenceTransformer

from google.colab import userdata
userdata.get('RAG-API')

# -----------------------------
# 1) Load embeddings + index
# -----------------------------
embeddings_df = joblib.load("embeddings.joblib")
emb_matrix = np.vstack(embeddings_df["embedding"].values).astype("float32")
texts = embeddings_df["text"].tolist()

index = faiss.IndexFlatL2(emb_matrix.shape[1])
index.add(emb_matrix)
print(f"✅ Loaded {len(texts)} chunks into FAISS.")

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def retrieve_context(query, top_k=4):
    q = embedder.encode([query], convert_to_numpy=True).astype("float32")
    _, idx = index.search(q, top_k)
    return "\n\n".join([f"[{r+1}] {texts[i]}" for r, i in enumerate(idx[0])])

# -----------------------------
# 2) Pick provider (GROQ/GEMINI)
# -----------------------------
PROVIDER = "GROQ" if os.getenv("GROQ_API_KEY") else ("GEMINI" if os.getenv("GEMINI_API_KEY") else None)
if not PROVIDER:
    raise RuntimeError(
        "No provider configured. Set one of:\n"
        "  os.environ['GROQ_API_KEY']   = '...'\n"
        "  os.environ['GEMINI_API_KEY'] = '...'\n"
        "Then re-run this cell."
    )

SYSTEM_PROMPT = (
    "You are a precise teaching assistant. Use ONLY the provided context. "
    "If the answer isn't in the context, say you don't know briefly."
)

# -----------------------------
# 3) Provider-specific setup
# -----------------------------
if PROVIDER == "GROQ":
    from groq import Groq
    groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

    # List models available to this key and auto-pick a good one
    try:
        available_models = sorted([m.id for m in groq_client.models.list().data])
    except Exception:
        print("⚠️ Could not list Groq models. Trying known names.")
        available_models = []

    print("🔎 Groq models available:\n", "\n ".join(available_models) or "(listing failed)")

    PREFERRED = [
        # common currently-supported ids (adjusts over time)
        "llama-3.1-70b-specdec",
        "llama-3.1-8b-instant",
        "mixtral-8x7b-32768",
        "gemma2-9b-it",
        # fallbacks some accounts expose
        "llama3-70b-8192",
        "llama3-8b-8192",
    ]
    MODEL = next((m for m in PREFERRED if (not available_models or m in available_models)), None)
    if not MODEL:
        raise RuntimeError(
            "No preferred Groq models found for your key. Enable one in Groq console, "
            "or replace PREFERRED with a model id you do have."
        )

    print(f"✅ Using Groq model: {MODEL}")

    def generate_answer(query, context):
        prompt = f"{SYSTEM_PROMPT}\n\nCONTEXT:\n{context}\n\nQUESTION:\n{query}\n\nAnswer:"
        try:
            resp = groq_client.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.2,
                max_tokens=512,
            )
            return resp.choices[0].message.content.strip()
        except Exception as e:
            return f"[Generation error: {e}]"

elif PROVIDER == "GEMINI":
    import google.generativeai as genai
    genai.configure(api_key=os.environ["GEMINI_API_KEY"])
    model = genai.GenerativeModel("gemini-1.5-flash")

    def generate_answer(query, context):
        prompt = f"{SYSTEM_PROMPT}\n\nCONTEXT:\n{context}\n\nQUESTION:\n{query}\n\nAnswer:"
        try:
            resp = model.generate_content(prompt)
            return (resp.text or "").strip()
        except Exception as e:
            return f"[Generation error: {e}]"

# -----------------------------
# 4) Interactive QA
# -----------------------------
def ask_question():
    print(f"🔌 Provider: {PROVIDER}\nType 'exit' to quit.")
    while True:
        q = input("\n❓ Your question: ").strip()
        if q.lower() in ("exit","quit"):
            print("👋 Bye!"); break
        ctx = retrieve_context(q, top_k=4)
        print("\n🔎 Retrieved Context:\n", (ctx[:800] + ("..." if len(ctx) > 800 else "")))
        ans = generate_answer(q, ctx)
        print("\n💬 Answer:\n", ans)

ask_question()


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h

KeyboardInterrupt: 

# **🧩 Project Summary**

**Project Title: 🧠 RAG-Based Teaching AI Assistant (Offline Vosk + Embeddings + RAG Integration)**

This project builds a Retrieval-Augmented Generation (RAG) pipeline that converts video lectures into structured, searchable text and enables an AI-powered question-answering system — all while maintaining offline speech recognition capabilities using Vosk.

The system automatically:

Imports your teaching project folder directly from Google Drive.

Converts all uploaded video lectures (.mp4) into audio files (.mp3) using ffmpeg.

Transcribes those audios offline using the Vosk speech recognition model.

Cleans and structures the resulting text data into lightweight JSON files.

Creates semantic embeddings using SentenceTransformers for context retrieval.

Builds a FAISS vector index for efficient similarity search.

Integrates with Groq or Gemini LLMs for natural, context-aware question answering.

*Ultimately, this notebook allows educators, students, and developers to query lecture content interactively — creating a personalized AI teaching assistant trained on their own material.*