This project builds a movie-opinion intelligence pipeline on top of Actian VectorAI DB.
It supports:
- ingesting text evidence into Actian Vector DB
- semantic search over that evidence
- Gemini-powered RAG responses
- rating/market prediction from structured question/answer input
-
scrape_data_reddit.py- Main ingestion/search utility.
- Supports:
ingest(Reddit + Quora scrape path; may be blocked depending on environment)ingest-file --path ...(JSONL file ingest; recommended reliable path)search --query ...(semantic retrieval from Actian)
-
ingest_market_signals.py- Ingests:
- OMDb market/ratings data (IMDb, Rotten Tomatoes, Metacritic, BoxOffice fields)
- YouTube comments (via YouTube Data API)
- Writes all data to Actian through shared ingestion pipeline.
- Ingests:
-
ingest_mcu_market_pack.py- Bulk ingest for MCU/comparable titles.
- Uses OMDb + optional YouTube comments.
- Designed for larger baseline coverage.
-
app.py- Basic Gemini script for generating controversial "what-if" scenarios from title only.
-
rag_app.py- Interactive Gemini app that retrieves relevant context from Actian before scenario generation.
-
gemini_predict_from_vectordb.py- Gemini-based prediction engine.
- Retrieves related evidence from Actian and returns strict prediction JSON including:
- IMDb change
- Rotten Tomatoes change
- fan sentiment metrics
- box office change (% and predicted USD)
- Collect data
- Scrape/API load raw text + market metadata.
- Normalize to documents
- Unified schema (
RawDocument) with text + metadata.
- Unified schema (
- Chunk
- Text split into retrieval chunks.
- Embed
- Local hasher embedder (no embedding API required).
- Store in Actian
CortexClientcollection + payload per chunk.
- Retrieve + Predict
- Semantic search for evidence.
- Rule-based or Gemini-based forecasting.
- Python 3.10+ (3.12 recommended)
- Run Actian server (commonly in Codespaces/Linux x86_64 for beta stability)
- Endpoint typically
localhost:50051
GEMINI_API_KEY(for Gemini scripts)OMDB_API_KEY(for OMDb market data)YOUTUBE_API_KEY(for YouTube comments)
From repo root:
python3 -m venv .venv
source .venv/bin/activate
pip install requests beautifulsoup4 praw python-dotenv google-generativeai
pip install ./actian-vectorAI-db-beta/actiancortex-0.1.0b1-py3-none-any.whlIf wheel path differs, point to your actual actiancortex-0.1.0b1-py3-none-any.whl.
Set these before running scripts:
export CORTEX_ADDRESS=localhost:50051
export ACTIAN_COLLECTION_NAME=endgame_opinions
export ACTIAN_RECREATE_COLLECTION=falseexport GEMINI_API_KEY="YOUR_GEMINI_KEY"
export GOOGLE_API_KEY="$GEMINI_API_KEY"export OMDB_API_KEY="YOUR_OMDB_KEY"
export YOUTUBE_API_KEY="YOUR_YOUTUBE_KEY"export LOCAL_EMBED_DIM=1024
export CHUNK_SIZE_WORDS=120
export CHUNK_OVERLAP_WORDS=20Notes:
- Keep
ACTIAN_RECREATE_COLLECTION=falsefor normal runs. - Set
ACTIAN_RECREATE_COLLECTION=trueonly when intentionally resetting collection.
From bundled beta repo:
cd actian-vectorAI-db-beta
docker compose up -d
docker compose psIf you already have JSONL dataset:
python3 scrape_data_reddit.py ingest-file --path data/movie_opinions.jsonlJSONL format:
- required:
text - optional:
id,source,url,title,author,created_at,score,metadata
OMDb + YouTube for one movie:
python3 ingest_market_signals.py --movie-title "Avengers: Endgame"OMDb only:
python3 ingest_market_signals.py --movie-title "Avengers: Endgame" --skip-youtubeYouTube only:
python3 ingest_market_signals.py --movie-title "Avengers: Endgame" --skip-omdbSafe starter:
python3 ingest_mcu_market_pack.py \
--max-titles 4 \
--youtube-max-videos 1 \
--youtube-comments-per-video 20Scale up:
python3 ingest_mcu_market_pack.py \
--max-titles 8 \
--youtube-max-videos 1 \
--youtube-comments-per-video 30OMDb-only bulk:
python3 ingest_mcu_market_pack.py --max-titles 12 --skip-youtubepython3 scrape_data_reddit.py search --query "Avengers Endgame IMDb Rotten Tomatoes box office" --top-k 10
python3 scrape_data_reddit.py search --query "Infinity War audience reaction ending" --top-k 10Look for source values like:
omdb_marketyoutube_commentdataset(if file-ingested)
python3 app.pypython3 rag_app.pypython3 gemini_predict_from_vectordb.py --input input.json --output prediction.jsonWith more retrieval per question:
python3 gemini_predict_from_vectordb.py --input input.json --output prediction.json --top-k-per-query 10{
"questions": [
{"id": "q1", "text": "Was the pacing of the climax well-executed?"},
{"id": "q2", "text": "Did multiple viewings reveal new details and depth in the ending?"}
],
"answers": {
"q1": "yes",
"q2": "no"
}
}{
"id": "avengers-endgame",
"title": "Avengers: Endgame",
"year": 2019,
"predictions": {
"imdb": {"current": 84, "predicted": 86, "delta": 2},
"rt": {"current": 90, "predicted": 92, "delta": 2},
"fanRating": {"positivePercent": 61.2, "negativePercent": 38.8, "netSentiment": 22.4},
"boxOffice": {"currentUsd": 858373000, "predictedUsd": 901291650, "deltaPercent": 5.0}
},
"assumptions": ["...", "..."]
}- Source blocked (common in cloud dev envs).
- Use
ingest-fileor API-backed sources (ingest_market_signals.py).
- Reduce ingestion volume per run.
- Use smaller settings:
- fewer titles/videos/comments
- smaller
CHUNK_SIZE_WORDS
- Export both:
GEMINI_API_KEYGOOGLE_API_KEY=$GEMINI_API_KEY
- Ensure you are in repo root:
/workspaces/Hacklytics(Codespaces)
- Never commit
.env. - Add to
.gitignore:.env
- Rotate keys if exposed in logs/chat.
- Start Actian DB.
- Ingest OMDb + YouTube for Endgame:
ingest_market_signals.py
- Run semantic search to show grounded evidence.
- Run
gemini_predict_from_vectordb.pyon Q/A input. - Show JSON prediction output and explain evidence traceability.