# E2E Example: YT Topics Pro

This notebook demonstrates a minimal, end-to-end run of the `yt-topics-pro` pipeline.

1.  **Ingest**: Fetch data for a few public videos.
2.  **Process**: Chunk, embed, model topics, and run sentiment analysis.
3.  **Evaluate**: Compute topic quality metrics.
4.  **Dashboard**: Show how to launch the interactive dashboard.


In [None]:
import os
import subprocess
from pathlib import Path

# Make sure we're running from the project root
if Path.cwd().name != "yt-topics-pro":
    os.chdir("..")

print(f"Current working directory: {Path.cwd()}")

# Define paths
VIDEOS_FILE = "videos.txt"
DATA_DIR = "data"

## 1. Ingest Data

First, we'll create a `videos.txt` file with a few YouTube video IDs. Then, we'll run the `ingest` command from our CLI.

In [None]:
# A few interesting, shorter videos from different domains
video_ids = [
    "vC_T_3T-P_s", # Tom Scott: The GPS constellation
    "6_pnl_6o_jA", # Marques Brownlee: The Problem with Foldable Phones
    "iLcvvA2fStA", # Kurzgesagt: What If You Detonated a Nuclear Bomb In The Mariana Trench?
    "X4d_w_2G5d4", # SmarterEveryDay: A Baffling Balloon Behavior
    "kPRA0W1kECg", # 3Blue1Brown: The essence of calculus
]

with open(VIDEOS_FILE, "w") as f:
    for vid in video_ids:
        f.write(f"https://www.youtube.com/watch?v={vid}\n")

print(f"Created '{VIDEOS_FILE}' with {len(video_ids)} videos.")

In [None]:
# Run the ingest command
# Using --limit for a quick test run
ingest_command = [
    "yt-topics-pro", "ingest",
    "--videos", VIDEOS_FILE,
    "--limit", "5"
]

# We use subprocess.run to execute our CLI commands from within the notebook
result = subprocess.run(ingest_command, capture_output=True, text=True)

print("--- Ingest Output ---")
print(result.stdout)
if result.stderr:
    print("--- Ingest Errors ---")
    print(result.stderr)

# Check that the output files were created
parquet_raw_dir = Path(DATA_DIR) / "parquet" / "raw"
assert (parquet_raw_dir / "transcripts.parquet").exists()
assert (parquet_raw_dir / "metadata.parquet").exists()
print(f"\n✅ Raw data saved in: {parquet_raw_dir}")

## 2. Process Data

Now we run the main processing pipeline. This will:
- Chunk the transcripts
- Generate embeddings
- Run BERTopic
- Calculate sentiment

We'll use the `--gpu` flag if available, but it will fall back to CPU gracefully. We'll also use a `--sample` flag to run on a small subset of chunks for speed in this example.

In [None]:
# Run the process command
process_command = [
    "yt-topics-pro", "process",
    "--sample", "100", # Use a small sample for a quick demo
    # "--gpu" # Uncomment if you have a compatible GPU and CUDA installed
]

result = subprocess.run(process_command, capture_output=True, text=True)

print("--- Process Output ---")
print(result.stdout)
if result.stderr:
    print("--- Process Errors ---")
    print(result.stderr)

# Check that the output files were created
parquet_processed_dir = Path(DATA_DIR) / "parquet" / "processed"
assert (parquet_processed_dir / "chunks.parquet").exists()
print(f"\n✅ Processed data saved in: {parquet_processed_dir}")

## 3. Evaluate Topics

Next, we'll run the evaluation command to compute coherence and diversity scores for the generated topics.

**Note:** The current `evaluate` command is a stub and uses sample data. A full implementation would load the model and data from the `process` step.

In [None]:
# Run the evaluate command
evaluate_command = ["yt-topics-pro", "evaluate"]
result = subprocess.run(evaluate_command, capture_output=True, text=True)

print("--- Evaluate Output ---")
print(result.stdout)
if result.stderr:
    print("--- Evaluate Errors ---")
    print(result.stderr)

## 4. Launch the Dashboard

Finally, the `dashboard` command launches the interactive Streamlit application.

Running this cell will start a web server. You'll need to stop the Jupyter kernel to stop the server. In a real workflow, you would run this command directly from your terminal.

In [None]:
# The command to run is:
# streamlit run src/yt_topics_pro/app/dashboard.py

print("To launch the dashboard, open your terminal and run the following command:")
print("\n" + "="*50)
print("streamlit run src/yt_topics_pro/app/dashboard.py")
print("="*50 + "\n")

# We won't run it here as it blocks the notebook.