# Guided Workflow: From Data to Podcast

This notebook walks through the core concepts of the app step-by-step: config, logging, scraping sample data, embeddings, similarity search, LLM script generation (dry-run), and audio generation (dry-run).

What you'll learn:
- How configuration and logging work
- How data flows from scraped/publication JSON to embeddings
- How to run similarity search and select relevant papers
- How a podcast script would be generated (without calling external APIs)
- How audio would be generated (without calling external APIs)

In [1]:
# Setup: paths and imports
import sys, os
from pathlib import Path

notebook_dir = Path().resolve()
src_dir = notebook_dir.parent / 'src'
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))
    print('Added src to path:', src_dir)
print('Notebook dir:', notebook_dir)
print('Src dir:', src_dir, 'exists:', src_dir.exists())

Added src to path: /home/santi/Projects/UBMI-IFC-Podcast/src
Notebook dir: /home/santi/Projects/UBMI-IFC-Podcast/notebooks
Src dir: /home/santi/Projects/UBMI-IFC-Podcast/src exists: True


In [2]:
# Config and logging
from utils.config import load_config, get_data_dir, get_output_dir
from utils.logger import setup_logger, get_logger

setup_logger(level='INFO')
logger = get_logger('guided')
config = load_config()

print('Base URL (IFC):', config['ifc']['base_url'])
print('Embeddings model:', config['embeddings']['model_name'])
print('LLM provider:', config['llm']['provider'])
print('Audio provider:', config['audio']['provider'])
print('Data dir:', get_data_dir())
print('Output dir:', get_output_dir())

Base URL (IFC): https://www.ifc.unam.mx
Embeddings model: sentence-transformers/all-MiniLM-L6-v2
LLM provider: openai
Audio provider: elevenlabs
Data dir: /home/santi/Projects/UBMI-IFC-Podcast/data
Output dir: /home/santi/Projects/UBMI-IFC-Podcast/outputs


## Prepare a small working dataset
We'll create a small, in-memory dataset representative of scraped publications so you can run embeddings and vector search without hitting the web.

In [3]:
# Sample publications (title + abstract)
sample_articles = [
    {
        'title': 'Deregulation of interferon-gamma receptor 1 expression and its implications for lung adenocarcinoma progression',
        'abstract': 'Interferon-gamma (IFN-γ) has dual roles in cancer... We explore receptor dysregulation and downstream signaling.',
        'year': 2024,
        'doi': '10.5306/wjco.v15.i2.195'
    },
    {
        'title': 'Altered Expression of Thyroid- and Oxidative-Stress-Related Genes in Cardiomyocytes',
        'abstract': 'We analyze cardiomyocyte gene expression under metabolic stress and observe significant changes in oxidative pathways.',
        'year': 2024,
        'doi': '10.1093/cvr/cvae156'
    },
    {
        'title': 'Regulation of Cellular Polarity in Epithelial Tissues',
        'abstract': 'Epithelial polarity is critical for tissue homeostasis; we discuss signaling interactions and experimental findings.',
        'year': 2024,
        'doi': '10.1128/jb.00264-24'
    }
]
print(f'Prepared {len(sample_articles)} sample articles')

Prepared 3 sample articles


## Embeddings: generate vectors and run similarity search
We'll use the configured sentence-transformer model to embed the sample texts and run cosine similarity to find nearest items.

In [4]:
from embeddings.manager import EmbeddingsManager
import numpy as np

em = EmbeddingsManager(config)
# Build combined texts (title + abstract) as the manager does
texts = [f"{a['title']}. {a.get('abstract','')}" for a in sample_articles]
embeddings = em.generate_embeddings(texts)
print('Embeddings shape:', embeddings.shape)

# Similarity against a query
query = 'immune signaling in cancer and interferon pathways'
query_vec = em.generate_embeddings([query])[0]
sims = (embeddings @ query_vec) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_vec) + 1e-8)
ranking = np.argsort(-sims)
print('Query:', query)
for idx in ranking:
    print(f"  sim={sims[idx]:.3f} | {sample_articles[idx]['title'][:80]}")

[32m2025-09-11 14:42:21[0m | [1mINFO[0m | [36membeddings.manager[0m:[36mload_model[0m:[36m38[0m - [1mLoading embedding model: sentence-transformers/all-MiniLM-L6-v2[0m
[32m2025-09-11 14:42:23[0m | [1mINFO[0m | [36membeddings.manager[0m:[36mgenerate_embeddings[0m:[36m53[0m - [1mGenerating embeddings for 3 texts[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32m2025-09-11 14:42:24[0m | [1mINFO[0m | [36membeddings.manager[0m:[36mgenerate_embeddings[0m:[36m53[0m - [1mGenerating embeddings for 1 texts[0m


Embeddings shape: (3, 384)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: immune signaling in cancer and interferon pathways
  sim=0.715 | Deregulation of interferon-gamma receptor 1 expression and its implications for 
  sim=0.271 | Regulation of Cellular Polarity in Epithelial Tissues
  sim=0.162 | Altered Expression of Thyroid- and Oxidative-Stress-Related Genes in Cardiomyocy


## Vector-store demonstration (optional in-memory)
We can simulate a tiny vector index to retrieve k-NN for a query.

In [5]:
# Simple in-memory vector search
def top_k(query_text, k=2):
    q = em.generate_embeddings([query_text])[0]
    sims = (embeddings @ q) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(q) + 1e-8)
    order = np.argsort(-sims)[:k]
    return [(float(sims[i]), sample_articles[i]) for i in order]

results = top_k('cardiac metabolism under stress', k=2)
for s, art in results:
    print(f"sim={s:.3f} | {art['title']}")

[32m2025-09-11 14:43:14[0m | [1mINFO[0m | [36membeddings.manager[0m:[36mgenerate_embeddings[0m:[36m53[0m - [1mGenerating embeddings for 1 texts[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

sim=0.559 | Altered Expression of Thyroid- and Oxidative-Stress-Related Genes in Cardiomyocytes
sim=0.160 | Regulation of Cellular Polarity in Epithelial Tissues


## LLM: prepare prompt and dry-run generation
We'll assemble a prompt for script generation but avoid live API calls. This shows what would be sent to OpenAI/Anthropic.

In [6]:
from llm.script_generator import PodcastScriptGenerator

# Build a pseudo selection of top articles (use top-2 from prior vector search)
selected = [r[1] for r in results]

# Create the generator (this will read provider from config, but we won't call the API)
gen = PodcastScriptGenerator(config)

# Inspect the prompt that would be sent (using the internal summary helper)
prompt_preview = gen._prepare_articles_summary([{
    'title': a['title'],
    'abstract': a.get('abstract',''),
    'doi': a.get('doi',''),
    'score': 0.0
} for a in selected])
print('--- Prompt preview (truncated) ---')
print(prompt_preview[:800])

--- Prompt preview (truncated) ---

Article 1:
Title: Altered Expression of Thyroid- and Oxidative-Stress-Related Genes in Cardiomyocytes
Authors: N/A
Journal: N/A
Publication Date: N/A
Similarity Score: 0.000

Abstract: We analyze cardiomyocyte gene expression under metabolic stress and observe significant changes in oxidative pathways.

Key Findings: Key findings not extracted.


Article 2:
Title: Regulation of Cellular Polarity in Epithelial Tissues
Authors: N/A
Journal: N/A
Publication Date: N/A
Similarity Score: 0.000

Abstract: Epithelial polarity is critical for tissue homeostasis; we discuss signaling interactions and experimental findings.

Key Findings: Key findings not extracted.



## Audio: TTS preparation (dry-run)
We'll clean the script text and show the first lines of what would be sent to TTS.

In [7]:
from audio.generator import AudioGenerator

audio = AudioGenerator(config)
fake_script = """# Podcast Episode

Welcome to this week's research roundup. 
First, we discuss interferon signaling in lung cancer. 
Then, cardiomyocyte responses to metabolic stress.
"""

cleaned = audio._clean_script_for_tts(fake_script)
print('--- Cleaned script ---')
print('\n'.join(cleaned.splitlines()[:10]))

--- Cleaned script ---
Podcast Episode

Welcome to this week's research roundup.
First, we discuss interferon signaling in lung cancer.
Then, cardiomyocyte responses to metabolic stress.


## Putting it together (pseudo-pipeline)
This is how the pipeline composes: select articles -> summarize with LLM -> synthesize audio. We'll keep calls offline, but outline the flow.