# Testing Advanced AI-Powered Scientific Podcast Generation

This notebook tests advanced functionalities for scientific podcast generation using mock data and fallback implementations for paid APIs.

We can replace our current keyword-based `_identify_research_field` function with a much more robust machine learning classifier. This will provide more accurate and nuanced categorization of research articles.

## Features to Test:
1. **AI-Powered Scientific Field Classifier** - Automatic categorization using embeddings
2. **Structured Script Generation** - Pydantic-based consistent scientific narrative
3. **Multi-Modal RAG Context** - Context-aware generation with related research
4. **Complete Pipeline Integration** - All features working together
5. **Error Handling** - Robust fallback mechanisms

All tests use mock data so they can run without paid API access.

In [1]:
# Setup: paths and imports
import sys, os
from pathlib import Path
import asyncio
import json
from typing import Dict, List, Optional
from dataclasses import dataclass
import numpy as np
import pandas as pd

# Add project paths
notebook_dir = Path().resolve()
src_dir = notebook_dir.parent / 'src'
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))
    print('Added src to path:', src_dir)
    
print('Notebook dir:', notebook_dir)
print('Src dir:', src_dir, 'exists:', src_dir.exists())

# Create output directories
outputs_dir = notebook_dir.parent / 'outputs'
test_outputs_dir = outputs_dir / 'advanced_tests'
test_outputs_dir.mkdir(parents=True, exist_ok=True)

print(f'Test outputs directory: {test_outputs_dir}')

Added src to path: /home/santi/Projects/UBMI-IFC-Podcast/src
Notebook dir: /home/santi/Projects/UBMI-IFC-Podcast/notebooks
Src dir: /home/santi/Projects/UBMI-IFC-Podcast/src exists: True
Test outputs directory: /home/santi/Projects/UBMI-IFC-Podcast/outputs/advanced_tests


In [2]:
# Install required packages
import subprocess
import sys

packages_to_install = ['pydantic', 'scikit-learn', 'numpy', 'pandas']

for package in packages_to_install:
    try:
        __import__(package.replace('-', '_'))
        print(f'✅ {package} already installed')
    except ImportError:
        print(f'📦 Installing {package}...')
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
        print(f'✅ {package} installed successfully')

# Now import all required packages
from pydantic import BaseModel, Field, ConfigDict
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import json
from typing import List, Dict, Optional, Any
from dataclasses import dataclass
from datetime import datetime
import random

print('\n✅ All required packages imported successfully')

✅ pydantic already installed
📦 Installing scikit-learn...
✅ scikit-learn installed successfully
✅ numpy already installed
✅ pandas already installed

✅ All required packages imported successfully


## Section 1: Setup and Configuration

Create mock providers and configure test environment with fallback implementations for paid APIs.

In [3]:
# Mock providers for testing without paid APIs
class MockEmbeddingProvider:
    """Mock embedding provider that generates realistic embeddings for testing"""
    
    def __init__(self):
        self.embedding_dim = 768  # Standard embedding dimension
        self.logger = self._setup_logger()
        
    def _setup_logger(self):
        import logging
        logging.basicConfig(level=logging.INFO)
        return logging.getLogger(__name__)
    
    async def generate_embedding(self, text: str, task_type: str = "CLASSIFICATION") -> List[float]:
        """Generate mock embedding based on text content"""
        # Use text hashing and feature extraction to create realistic embeddings
        words = text.lower().split()
        
        # Create seed based on text content for reproducible embeddings
        seed = sum(ord(c) for c in text) % 10000
        np.random.seed(seed)
        
        # Base embedding with random values
        embedding = np.random.normal(0, 0.1, self.embedding_dim)
        
        # Add semantic features based on keywords
        feature_weights = {
            'neuroscience': np.random.normal(0.8, 0.1, 50),
            'brain': np.random.normal(0.7, 0.1, 50),
            'neural': np.random.normal(0.7, 0.1, 50),
            'cancer': np.random.normal(0.9, 0.1, 50),
            'tumor': np.random.normal(0.8, 0.1, 50),
            'oncology': np.random.normal(0.8, 0.1, 50),
            'immune': np.random.normal(0.7, 0.1, 50),
            'antibody': np.random.normal(0.6, 0.1, 50),
            'vaccine': np.random.normal(0.6, 0.1, 50),
            'gene': np.random.normal(0.8, 0.1, 50),
            'dna': np.random.normal(0.7, 0.1, 50),
            'protein': np.random.normal(0.6, 0.1, 50),
        }
        
        # Apply semantic weights
        for word in words:
            if word in feature_weights:
                feature_vec = feature_weights[word][:self.embedding_dim]
                embedding[:len(feature_vec)] += feature_vec
        
        # Normalize
        embedding = embedding / np.linalg.norm(embedding)
        
        self.logger.info(f"Generated embedding for text: '{text[:50]}...' (dim: {len(embedding)})")
        return embedding.tolist()

class MockLLMProvider:
    """Mock LLM provider that generates realistic responses for testing"""
    
    def __init__(self):
        self.logger = self._setup_logger()
        
    def _setup_logger(self):
        import logging
        logging.basicConfig(level=logging.INFO)
        return logging.getLogger(__name__)
    
    async def generate_response(self, prompt: str, **kwargs) -> str:
        """Generate mock response based on prompt analysis"""
        prompt_lower = prompt.lower()
        
        # Detect request type and generate appropriate response
        if 'json' in kwargs.get('response_format', '').lower() or 'json' in prompt_lower:
            return self._generate_structured_response(prompt)
        elif 'podcast script' in prompt_lower:
            return self._generate_podcast_script(prompt)
        elif 'context' in prompt_lower and 'previous work' in prompt_lower:
            return self._generate_context_aware_script(prompt)
        else:
            return self._generate_general_response(prompt)
    
    def _generate_structured_response(self, prompt: str) -> str:
        """Generate structured JSON response for Pydantic validation"""
        mock_response = {
            "podcast_title": "Breakthrough in Neural Network Plasticity: How Brain Cells Adapt to New Information",
            "introduction": "Have you ever wondered how your brain learns and adapts throughout your life? Today we're diving into fascinating new research that reveals the incredible flexibility of neural networks in the adult brain.",
            "methods_summary": "Researchers used advanced brain imaging techniques combined with machine learning algorithms to track how individual neurons change their connections over time during learning tasks.",
            "key_findings": [
                "Adult neurons can form new connections 40% faster than previously thought",
                "Learning triggers specific protein cascades that strengthen synaptic bonds",
                "Brain plasticity varies significantly across different regions and age groups"
            ],
            "implications_and_significance": "These findings could revolutionize how we approach neurological rehabilitation and age-related cognitive decline. Understanding neural plasticity mechanisms opens new doors for treating stroke patients and developing brain-training programs.",
            "conclusion": "This research fundamentally changes our understanding of the adult brain's capacity for adaptation, proving that we're never too old to learn new tricks - literally at the cellular level."
        }
        return json.dumps(mock_response, indent=2)
    
    def _generate_podcast_script(self, prompt: str) -> str:
        """Generate mock podcast script"""
        return """Welcome to Science Decoded, the podcast where we break down the latest breakthroughs in research.

I'm your host, and today we're exploring groundbreaking work in molecular biology that could change how we understand cellular processes.

**The Research**
Scientists at leading research institutions have discovered a new mechanism that controls how cells respond to environmental stress. Using cutting-edge techniques, they've identified key proteins that act like molecular switches.

**Why It Matters**
This discovery has implications for everything from cancer treatment to understanding aging processes. The research opens new avenues for therapeutic intervention.

**Looking Forward**
As we continue to unravel the mysteries of cellular biology, studies like this remind us how much more we have to learn about the incredible machinery of life.

Thanks for joining us today on Science Decoded."""
    
    def _generate_context_aware_script(self, prompt: str) -> str:
        """Generate context-aware script that references previous work"""
        return """Welcome to Research Frontiers, where we explore how today's discoveries build on yesterday's breakthroughs.

Today's study represents a significant advancement over previous work in this field. While earlier research established the baseline mechanisms, this new study identifies the specific molecular players involved.

**Building on Previous Work**
The authors explicitly reference how their findings extend beyond the initial observations from previous studies. Where the earlier work noted general patterns, this research pinpoints exact molecular pathways.

**Novel Contributions**
What makes this work particularly exciting is how it fills gaps left by previous research. The team has identified the missing pieces that complete our understanding of this biological system.

**Scientific Impact**
By building upon the solid foundation of previous studies, this research demonstrates the collaborative nature of scientific progress. Each discovery adds another piece to the puzzle.

This is exactly how science should work - standing on the shoulders of giants to see even further."""

    def _generate_general_response(self, prompt: str) -> str:
        """Generate general response"""
        return "This is a mock response generated for testing purposes. In a production environment, this would be replaced with actual LLM output."

# Initialize mock providers
mock_embedding_provider = MockEmbeddingProvider()
mock_llm_provider = MockLLMProvider()

print("✅ Mock providers initialized successfully")
print("   📊 Embedding Provider: Ready (768-dimensional vectors)")
print("   🤖 LLM Provider: Ready (structured + contextual responses)")

✅ Mock providers initialized successfully
   📊 Embedding Provider: Ready (768-dimensional vectors)
   🤖 LLM Provider: Ready (structured + contextual responses)


## Section 2: Mock Data Generation

Create realistic mock scientific articles, embeddings, and training data for testing.

In [4]:
# Generate comprehensive mock scientific articles for testing
class MockDataGenerator:
    """Generate realistic mock data for testing scientific podcast features"""
    
    def __init__(self):
        self.fields = [
            'Neuroscience', 'Cancer Research', 'Immunology', 'Genetics', 
            'Biochemistry', 'Cardiology', 'Infectious Disease', 'Oncology'
        ]
        self.journals = [
            'Nature', 'Science', 'Cell', 'Nature Medicine', 'PNAS',
            'Nature Biotechnology', 'Journal of Clinical Investigation',
            'Nature Neuroscience', 'Cancer Cell', 'Immunity'
        ]
    
    def generate_mock_articles(self, count: int = 50) -> List[Dict]:
        """Generate a dataset of mock scientific articles"""
        articles = []
        
        # Field-specific templates
        field_templates = {
            'Neuroscience': {
                'keywords': ['brain', 'neural', 'neuron', 'synaptic', 'cortex', 'hippocampus', 'plasticity'],
                'methods': ['fMRI', 'electrophysiology', 'optogenetics', 'behavioral testing'],
                'findings': ['increased activity', 'enhanced connectivity', 'improved memory', 'altered signaling']
            },
            'Cancer Research': {
                'keywords': ['tumor', 'cancer', 'oncology', 'metastasis', 'chemotherapy', 'malignant'],
                'methods': ['immunohistochemistry', 'RNA sequencing', 'cell culture', 'xenograft models'],
                'findings': ['reduced tumor growth', 'inhibited metastasis', 'enhanced survival', 'drug resistance']
            },
            'Immunology': {
                'keywords': ['immune', 'antibody', 'T-cell', 'vaccine', 'inflammation', 'cytokine'],
                'methods': ['flow cytometry', 'ELISA', 'immunofluorescence', 'cell sorting'],
                'findings': ['enhanced immune response', 'reduced inflammation', 'improved vaccine efficacy']
            },
            'Genetics': {
                'keywords': ['gene', 'DNA', 'mutation', 'genome', 'CRISPR', 'hereditary'],
                'methods': ['genome sequencing', 'PCR', 'gene editing', 'linkage analysis'],
                'findings': ['identified mutations', 'gene function revealed', 'inheritance patterns']
            },
            'Biochemistry': {
                'keywords': ['protein', 'enzyme', 'molecular', 'pathway', 'metabolism', 'catalysis'],
                'methods': ['protein purification', 'crystallography', 'mass spectrometry'],
                'findings': ['protein structure revealed', 'enzyme activity measured', 'pathway identified']
            }
        }
        
        for i in range(count):
            # Select random field and template
            field = random.choice(list(field_templates.keys()))
            template = field_templates[field]
            
            # Generate realistic article
            title_parts = [
                random.choice(['Novel', 'Advanced', 'Comprehensive', 'Systematic', 'Innovative']),
                random.choice(['mechanisms of', 'role of', 'regulation of', 'analysis of']),
                random.choice(template['keywords']),
                random.choice(['in', 'during', 'following']),
                random.choice(['disease progression', 'cellular response', 'development', 'treatment'])
            ]
            
            title = f"{title_parts[0]} {title_parts[1]} {title_parts[2]} {title_parts[3]} {title_parts[4]}"
            
            # Generate abstract
            abstract_parts = [
                f"This study investigates {random.choice(template['keywords'])} using {random.choice(template['methods'])}.",
                f"We analyzed samples and found {random.choice(template['findings'])}.",
                "Statistical analysis revealed significant differences between experimental groups.",
                f"These findings have important implications for {field.lower()} research.",
                "Our results contribute to the understanding of underlying mechanisms."
            ]
            
            abstract = " ".join(abstract_parts)
            
            # Generate other metadata
            authors = [f"{chr(65 + random.randint(0, 25))}. {random.choice(['Smith', 'Johnson', 'Williams', 'Brown', 'Jones'])}" 
                      for _ in range(random.randint(3, 8))]
            
            article = {
                'title': title,
                'abstract': abstract,
                'authors': authors,
                'journal': random.choice(self.journals),
                'publication_date': f"2024-{random.randint(1,12):02d}-{random.randint(1,28):02d}",
                'doi': f"10.{random.randint(1000,9999)}/{random.randint(100000,999999)}",
                'pmid': str(random.randint(30000000, 39999999)),
                'field': field,  # Ground truth for classification
                'score': random.uniform(0.7, 1.0)
            }
            
            articles.append(article)
        
        return articles

# Generate mock data
data_generator = MockDataGenerator()
mock_articles = data_generator.generate_mock_articles(50)

print(f"✅ Generated {len(mock_articles)} mock scientific articles")
print("\n📊 Data Distribution:")
field_counts = pd.Series([article['field'] for article in mock_articles]).value_counts()
for field, count in field_counts.items():
    print(f"   {field}: {count} articles")

print(f"\n📄 Sample Article:")
sample = mock_articles[0]
print(f"   Title: {sample['title']}")
print(f"   Field: {sample['field']}")
print(f"   Journal: {sample['journal']}")
print(f"   Abstract: {sample['abstract'][:150]}...")

# Save mock data for reference
mock_data_path = test_outputs_dir / 'mock_articles.json'
with open(mock_data_path, 'w') as f:
    json.dump(mock_articles, f, indent=2)
print(f"\n💾 Mock data saved to: {mock_data_path}")

✅ Generated 50 mock scientific articles

📊 Data Distribution:
   Biochemistry: 12 articles
   Neuroscience: 10 articles
   Genetics: 10 articles
   Cancer Research: 10 articles
   Immunology: 8 articles

📄 Sample Article:
   Title: Systematic regulation of protein during development
   Field: Biochemistry
   Journal: Science
   Abstract: This study investigates molecular using mass spectrometry. We analyzed samples and found protein structure revealed. Statistical analysis revealed sig...

💾 Mock data saved to: /home/santi/Projects/UBMI-IFC-Podcast/outputs/advanced_tests/mock_articles.json


In [5]:
# Generate mock embeddings for the articles
async def generate_mock_embeddings_dataset():
    """Generate embeddings for all mock articles"""
    print("🧮 Generating mock embeddings for classification training...")
    
    embeddings_data = []
    
    for i, article in enumerate(mock_articles):
        if i % 10 == 0:
            print(f"   Processing article {i+1}/{len(mock_articles)}")
        
        # Generate embedding for abstract
        embedding = await mock_embedding_provider.generate_embedding(
            article['abstract'], 
            task_type="CLASSIFICATION"
        )
        
        embeddings_data.append({
            'pmid': article['pmid'],
            'field': article['field'],
            'embedding': embedding,
            'title': article['title'][:100]  # Truncated for storage
        })
    
    print(f"✅ Generated embeddings for {len(embeddings_data)} articles")
    print(f"   Embedding dimension: {len(embeddings_data[0]['embedding'])}")
    
    return embeddings_data

# Generate embeddings dataset
embeddings_dataset = await generate_mock_embeddings_dataset()

# Save embeddings dataset
embeddings_path = test_outputs_dir / 'mock_embeddings.json'
with open(embeddings_path, 'w') as f:
    json.dump(embeddings_dataset, f, indent=2)

print(f"💾 Embeddings saved to: {embeddings_path}")
print(f"📊 Dataset size: {len(embeddings_dataset)} articles with {len(embeddings_dataset[0]['embedding'])}-dim embeddings")

INFO:__main__:Generated embedding for text: 'This study investigates molecular using mass spect...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates T-cell using flow cytometr...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates pathway using mass spectro...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates neural using electrophysio...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates genome using linkage analy...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates malignant using immunohist...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates hereditary using linkage a...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates plasticity using behaviora...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study investigates neuron using optogenetics....' (dim: 768)
I

🧮 Generating mock embeddings for classification training...
   Processing article 1/50
   Processing article 11/50
   Processing article 21/50
   Processing article 31/50
   Processing article 41/50
✅ Generated embeddings for 50 articles
   Embedding dimension: 768
💾 Embeddings saved to: /home/santi/Projects/UBMI-IFC-Podcast/outputs/advanced_tests/mock_embeddings.json
📊 Dataset size: 50 articles with 768-dim embeddings


## Section 3: Scientific Field Classifier Testing

Test the AI-powered scientific field classifier using mock embeddings and pre-trained models.

In [7]:
# Scientific Field Classifier Implementation
class MockScientificFieldClassifier:
    """
    Scientific field classifier using embeddings for automatic categorization.
    Uses mock embeddings and scikit-learn for testing without paid APIs.
    """
    
    def __init__(self, embedding_provider):
        self.model = LogisticRegression(
            class_weight='balanced', 
            max_iter=1000,
            random_state=42
        )
        self.scaler = StandardScaler()
        self.embedder = embedding_provider
        self.is_trained = False
        self.class_names = []
        self.label_to_class = {}
        self.class_to_label = {}
        print("✅ MockScientificFieldClassifier initialized")

    def prepare_training_data(self, embeddings_data: List[Dict]) -> tuple:
        """Prepare training data from embeddings dataset"""
        print(f"📊 Preparing training data from {len(embeddings_data)} samples...")
        
        # Extract features and labels
        X = np.array([item['embedding'] for item in embeddings_data])
        y_raw = [item['field'] for item in embeddings_data]
        
        # Create label mappings
        unique_fields = sorted(list(set(y_raw)))
        self.class_names = unique_fields
        self.label_to_class = {i: field for i, field in enumerate(unique_fields)}
        self.class_to_label = {field: i for i, field in enumerate(unique_fields)}
        
        # Convert to numeric labels
        y = np.array([self.class_to_label[field] for field in y_raw])
        
        print(f"   Features shape: {X.shape}")
        print(f"   Found {len(unique_fields)} unique fields: {unique_fields}")
        
        return X, y

    def train(self, embeddings_data: List[Dict]):
        """Train the classifier on embeddings data"""
        print("🧠 Training scientific field classifier...")
        
        X, y = self.prepare_training_data(embeddings_data)
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42, stratify=y
        )
        
        # Scale features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_test_scaled = self.scaler.transform(X_test)
        
        # Train model
        self.model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_score = self.model.score(X_train_scaled, y_train)
        test_score = self.model.score(X_test_scaled, y_test)
        
        print(f"   Training accuracy: {train_score:.3f}")
        print(f"   Test accuracy: {test_score:.3f}")
        
        # Detailed classification report
        y_pred = self.model.predict(X_test_scaled)
        class_names_for_report = [self.label_to_class[i] for i in range(len(self.class_names))]
        
        print("\n📊 Classification Report:")
        report = classification_report(
            y_test, y_pred, 
            target_names=class_names_for_report,
            zero_division=0
        )
        print(report)
        
        self.is_trained = True
        return train_score, test_score

    async def predict(self, article: Dict) -> tuple:
        """Predict the scientific field for a new article"""
        if not self.is_trained:
            raise RuntimeError("Classifier must be trained before making predictions")
        
        # Generate embedding
        embedding = await self.embedder.generate_embedding(
            article['abstract'], 
            task_type="CLASSIFICATION"
        )
        
        # Scale and predict
        X = np.array([embedding])
        X_scaled = self.scaler.transform(X)
        
        prediction_label = self.model.predict(X_scaled)[0]
        prediction_proba = self.model.predict_proba(X_scaled)[0]
        
        predicted_field = self.label_to_class[prediction_label]
        confidence = float(prediction_proba[prediction_label])
        
        # Get top 3 predictions with probabilities
        top_indices = np.argsort(prediction_proba)[::-1][:3]
        top_predictions = [
            (self.label_to_class[idx], float(prediction_proba[idx]))
            for idx in top_indices
        ]
        
        return predicted_field, confidence, top_predictions

# Initialize and test the classifier
classifier = MockScientificFieldClassifier(mock_embedding_provider)

# Train the classifier
train_acc, test_acc = classifier.train(embeddings_dataset)

print(f"\n🎯 Classifier Performance Summary:")
print(f"   Training Accuracy: {train_acc:.1%}")
print(f"   Test Accuracy: {test_acc:.1%}")
print(f"   Status: {'✅ Ready for production' if test_acc > 0.7 else '⚠️ Needs improvement'}")

✅ MockScientificFieldClassifier initialized
🧠 Training scientific field classifier...
📊 Preparing training data from 50 samples...
   Features shape: (50, 768)
   Found 5 unique fields: ['Biochemistry', 'Cancer Research', 'Genetics', 'Immunology', 'Neuroscience']
   Training accuracy: 1.000
   Test accuracy: 0.400

📊 Classification Report:
                 precision    recall  f1-score   support

   Biochemistry       0.00      0.00      0.00         2
Cancer Research       0.50      1.00      0.67         2
       Genetics       1.00      0.50      0.67         2
     Immunology       0.00      0.00      0.00         2
   Neuroscience       0.25      0.50      0.33         2

       accuracy                           0.40        10
      macro avg       0.35      0.40      0.33        10
   weighted avg       0.35      0.40      0.33        10


🎯 Classifier Performance Summary:
   Training Accuracy: 100.0%
   Test Accuracy: 40.0%
   Status: ⚠️ Needs improvement


> 👍 Success
>
> This entire process is a simulation of a machine learning workflow.
> The goal is to teach a simple AI model to associate the "meaning" of an article's abstract with a specific scientific field.


*From Words to Recognizable Patterns*

An AI model like LogisticRegression cannot understand words directly. It only understands numbers. The entire process is about converting the text of an article's abstract into a numerical fingerprint (an "embedding") and then teaching the model to recognize the fingerprints of different scientific fields.

### Step 1: Generating Realistic (But Fake) Scientific Data

- *What it is*: The MockDataGenerator class creates a dataset of 50 fake scientific articles.
- *How it works*: It uses templates with field-specific keywords (e.g., 'Neuroscience' gets 'brain', 'neural', 'synaptic'). It then randomly combines these to generate realistic-sounding titles and abstracts.
- Each fake article is given a "ground truth" label. For example:

```
# from MockDataGenerator
article = {
    'title': 'Novel mechanisms of neural plasticity...',
    'abstract': 'This study investigates brain activity...',
    'field': 'Neuroscience',  # <-- This is the correct answer key
}
```


### Step 2: Mock Embeddings (AI input)

- What it is: It converts an article's abstract into a 768-dimension numerical vector (the "embedding").
- How it simulates meaning:
  - Base Vector: It starts by creating a base vector of 768 random numbers.
  - Keyword Detection: It scans the abstract for the same keywords used in the data generator (e.g., 'brain', 'cancer', 'immune')
  - Injecting a "Semantic Signature": If it finds a keyword, it adds a specific, pre-defined numerical pattern to the base vector.
    - For example, all "neuroscience" articles will have a similar numerical pattern added to their embeddings because they contain words like 'brain' and 'neural'
  - Result: The final embedding is a numerical fingerprint. Abstracts with similar keywords will have mathematically similar fingerprints. This mimics how a real embedding model works, where semantically similar texts result in vectors that are "close" to each other in multi-dimensional space.

### Step 3: The AI Model - LogisticRegression

- What it is: This is the "AI brain" you are training. It's a classic, efficient, and powerful machine learning model for classification tasks.
- What it does: Its job is to find a mathematical formula that separates the different groups of numerical fingerprints. Think of it as learning to draw boundaries in a high-dimensional space to cordon off the "Neuroscience" embeddings from the "Cancer Research" embeddings.
- Why it's used here: It's fast to train and works very well when the classes are reasonably separable, which our mock embeddings are designed to be.

### Step 4: The Training Process (classifier.train)

This is where the learning happens.

- Prepare Data: The code takes the dataset of 50 articles and their corresponding mock embeddings. It now has the input (X = the 768-dimension embeddings) and the correct answer for each (y = the 'field' label, converted to a number like 0 for Neuroscience, 1 for Cancer Research, etc.).
- Split the Data: It splits the data into a training set (80%) and a testing set (20%) using train_test_split.
- Training Set: The model gets to see these examples and their correct answers to learn the patterns.
- Testing Set: This data is held back. The model has never seen it. We use it at the end to see how well the model performs on new, unseen data.
- Scaling: The StandardScaler normalizes the numerical features. This is a standard best practice that helps the LogisticRegression model learn more effectively and quickly.
- Learning (`model.fit`): This is the core training command. The model analyzes the training embeddings (X_train_scaled) and their labels (y_train) and adjusts its internal formula to best separate the different fields.
- Evaluation: After training, the code immediately uses the unseen testing set to evaluate performance. It shows the model the test embeddings and asks it to predict the field. It then compares the model's predictions to the correct answers (y_test) and calculates the accuracy. The classification_report gives you a detailed breakdown of its performance for each scientific field.

## Step 5: Making a Prediction on a New Article (classifier.predict)

Once trained, the classifier is ready to be used.

- A new, unseen article abstract is provided.
- It's passed to the MockEmbeddingProvider to get its unique 768-dimension numerical fingerprint.
- This new fingerprint is scaled using the same scaler from the training step.
- The scaled fingerprint is fed to the trained model.
- The model applies its learned formula and outputs the most likely scientific field (e.g., "Immunology") along with a confidence score.


In [12]:
# Test classifier with new articles
async def test_classifier_predictions():
    """Test the trained classifier on new articles"""
    print("🔍 Testing classifier predictions on new articles...")
    
    # Create test articles from different fields
    test_cases = [
        {
            'title': 'Deep Brain Stimulation Effects on Parkinson Disease Motor Symptoms',
            'abstract': 'We investigated the effects of deep brain stimulation on motor symptoms in Parkinson disease patients. Using neuroimaging and clinical assessments, we found significant improvements in motor function following DBS treatment. Neural activity patterns showed increased coherence in motor circuits.',
            'expected_field': 'Neuroscience'
        },
        {
            'title': 'CRISPR-Cas9 Mediated Gene Editing in Cancer Cell Lines',
            'abstract': 'This study employed CRISPR-Cas9 gene editing technology to target oncogenes in multiple cancer cell lines. We observed significant tumor growth inhibition and reduced metastatic potential following gene knockout. Molecular analysis revealed disrupted signaling pathways.',
            'expected_field': 'Cancer Research'
        },
        {
            'title': 'Novel Vaccine Adjuvants Enhance Antibody Response',
            'abstract': 'We developed new vaccine adjuvants to improve immune response to viral antigens. Flow cytometry analysis showed enhanced T-cell activation and increased antibody production. The adjuvanted vaccines provided superior protection in animal models.',
            'expected_field': 'Immunology'
        }
    ]
    
    results = []
    
    for i, test_case in enumerate(test_cases):
        print(f"\n🧪 Test Case {i+1}: {test_case['title'][:50]}...")
        
        predicted_field, confidence, top_predictions = await classifier.predict(test_case)
        
        is_correct = predicted_field == test_case['expected_field']
        
        print(f"   Expected: {test_case['expected_field']}")
        print(f"   Predicted: {predicted_field} (confidence: {confidence:.2%})")
        print(f"   Result: {'✅ Correct' if is_correct else '❌ Incorrect'}")
        
        print(f"   Top 3 predictions:")
        for rank, (field, prob) in enumerate(top_predictions, 1):
            print(f"      {rank}. {field}: {prob:.2%}")
        
        results.append({
            'test_case': i+1,
            'title': test_case['title'],
            'expected': test_case['expected_field'],
            'predicted': predicted_field,
            'confidence': confidence,
            'correct': is_correct,
            'top_predictions': top_predictions
        })
    
    # Calculate accuracy
    accuracy = sum(r['correct'] for r in results) / len(results)
    avg_confidence = sum(r['confidence'] for r in results) / len(results)
    
    print(f"\n📊 Classification Test Results:")
    print(f"   Accuracy: {accuracy:.1%} ({sum(r['correct'] for r in results)}/{len(results)})")
    print(f"   Average Confidence: {avg_confidence:.1%}")
    print(f"   Status: {'✅ Excellent' if accuracy >= 0.8 else '👍 Good' if accuracy >= 0.6 else '⚠️ Needs improvement'}")
    
    return results

# Run classifier tests
classification_results = await test_classifier_predictions()

# Save results
results_path = test_outputs_dir / 'classification_results.json'
with open(results_path, 'w') as f:
    json.dump(classification_results, f, indent=2, default=str)

print(f"\n💾 Classification results saved to: {results_path}")

INFO:__main__:Generated embedding for text: 'We investigated the effects of deep brain stimulat...' (dim: 768)
INFO:__main__:Generated embedding for text: 'This study employed CRISPR-Cas9 gene editing techn...' (dim: 768)
INFO:__main__:Generated embedding for text: 'We developed new vaccine adjuvants to improve immu...' (dim: 768)


🔍 Testing classifier predictions on new articles...

🧪 Test Case 1: Deep Brain Stimulation Effects on Parkinson Diseas...
   Expected: Neuroscience
   Predicted: Cancer Research (confidence: 42.99%)
   Result: ❌ Incorrect
   Top 3 predictions:
      1. Cancer Research: 42.99%
      2. Neuroscience: 33.72%
      3. Genetics: 11.55%

🧪 Test Case 2: CRISPR-Cas9 Mediated Gene Editing in Cancer Cell L...
   Expected: Cancer Research
   Predicted: Cancer Research (confidence: 45.09%)
   Result: ✅ Correct
   Top 3 predictions:
      1. Cancer Research: 45.09%
      2. Neuroscience: 38.46%
      3. Genetics: 6.82%

🧪 Test Case 3: Novel Vaccine Adjuvants Enhance Antibody Response...
   Expected: Immunology
   Predicted: Cancer Research (confidence: 42.77%)
   Result: ❌ Incorrect
   Top 3 predictions:
      1. Cancer Research: 42.77%
      2. Neuroscience: 38.97%
      3. Genetics: 7.88%

📊 Classification Test Results:
   Accuracy: 33.3% (1/3)
   Average Confidence: 43.6%
   Status: ⚠️ Needs imp

## Section 4: Structured Script Generation Testing

Test Pydantic-based structured script generation using mock API responses.

In [14]:
# Pydantic models for structured script generation
class PodcastScriptStructure(BaseModel):
    """Structured output schema for scientific podcast scripts"""
    model_config = ConfigDict(
        json_schema_extra={
            "example": {
                "podcast_title": "Revolutionary Cancer Treatment Shows Promise in Clinical Trials",
                "introduction": "Today we explore breakthrough research that could change cancer treatment...",
                "methods_summary": "Researchers used advanced immunotherapy techniques...",
                "key_findings": ["Finding 1", "Finding 2", "Finding 3"],
                "implications_and_significance": "These results suggest new therapeutic approaches...",
                "conclusion": "This research opens new doors for cancer patients worldwide."
            }
        }
    )
    
    podcast_title: str = Field(
        description="Engaging, accessible title for the podcast episode",
        min_length=10,
        max_length=100
    )
    introduction: str = Field(
        description="Hook to grab listener attention, introducing the research topic and importance",
        min_length=50,
        max_length=500
    )
    methods_summary: str = Field(
        description="Simplified explanation of key research methods, avoiding jargon",
        min_length=30,
        max_length=300
    )
    key_findings: List[str] = Field(
        description="List of 2-4 main results or discoveries, explained clearly",
        min_items=2,
        max_items=4
    )
    implications_and_significance: str = Field(
        description="Why these findings matter for science and the public",
        min_length=50,
        max_length=400
    )
    conclusion: str = Field(
        description="Summary and concluding thought to leave listeners with",
        min_length=30,
        max_length=200
    )

class StructuredScriptGenerator:
    """Generator for structured podcast scripts using Pydantic validation"""
    
    def __init__(self, llm_provider):
        self.llm_provider = llm_provider
        
    async def generate_structured_script(self, article: Dict) -> PodcastScriptStructure:
        """Generate a structured podcast script from an article"""
        
        prompt = f"""
        Generate a structured podcast script for the following scientific article.
        Make it accessible to a general audience while maintaining scientific accuracy.
        
        Article Title: {article.get('title', 'Unknown')}
        Journal: {article.get('journal', 'Unknown')}
        Field: {article.get('field', 'Unknown')}
        
        Abstract: {article.get('abstract', 'No abstract available')}
        
        Return the response as JSON matching the PodcastScriptStructure schema.
        Focus on clear explanations and engaging presentation suitable for audio format.
        """
        
        # Get JSON response from mock LLM
        response = await self.llm_provider.generate_response(
            prompt, 
            response_format="json"
        )
        
        # Parse and validate with Pydantic
        script_data = json.loads(response)
        structured_script = PodcastScriptStructure.model_validate(script_data)
        
        return structured_script
    
    def script_to_text(self, structured_script: PodcastScriptStructure) -> str:
        """Convert structured script to readable text format"""
        text_parts = [
            f"# {structured_script.podcast_title}\n",
            "## Introduction",
            structured_script.introduction,
            "\n## Methods",
            structured_script.methods_summary,
            "\n## Key Findings",
        ]
        
        for i, finding in enumerate(structured_script.key_findings, 1):
            text_parts.append(f"{i}. {finding}")
        
        text_parts.extend([
            "\n## Implications and Significance",
            structured_script.implications_and_significance,
            "\n## Conclusion",
            structured_script.conclusion
        ])
        
        return "\n".join(text_parts)

# Test structured script generation
script_generator = StructuredScriptGenerator(mock_llm_provider)

print("📝 Testing Structured Script Generation...")
print("=" * 60)

# Test with multiple articles
test_articles = mock_articles[:3]  # Use first 3 articles for testing

structured_results = []

for i, article in enumerate(test_articles):
    print(f"\n🧪 Test {i+1}: Generating script for {article['field']} research...")
    print(f"   Article: {article['title'][:60]}...")
    
    try:
        # Generate structured script
        structured_script = await script_generator.generate_structured_script(article)
        
        print("✅ Structured script generated successfully!")
        print(f"   Title: {structured_script.podcast_title}")
        print(f"   Introduction: {structured_script.introduction[:100]}...")
        print(f"   Key findings: {len(structured_script.key_findings)} items")
        
        # Validate structure
        validation_results = {
            'title_length': len(structured_script.podcast_title),
            'intro_length': len(structured_script.introduction),
            'findings_count': len(structured_script.key_findings),
            'has_conclusion': len(structured_script.conclusion) > 0
        }
        
        print(f"   Validation: Title({validation_results['title_length']} chars), "
              f"Findings({validation_results['findings_count']} items) ✅")
        
        # Convert to readable format
        full_text = script_generator.script_to_text(structured_script)
        
        # Save individual script
        script_filename = f"structured_script_{i+1}_{article['field'].lower().replace(' ', '_')}.md"
        script_path = test_outputs_dir / script_filename
        
        with open(script_path, 'w', encoding='utf-8') as f:
            f.write(full_text)
        
        print(f"   💾 Saved to: {script_filename}")
        
        structured_results.append({
            'article_id': i+1,
            'field': article['field'],
            'title': article['title'],
            'structured_script': structured_script.model_dump(),
            'validation': validation_results,
            'full_text_length': len(full_text),
            'status': 'success'
        })
        
    except Exception as e:
        print(f"❌ Script generation failed: {e}")
        structured_results.append({
            'article_id': i+1,
            'field': article['field'],
            'title': article['title'],
            'error': str(e),
            'status': 'error'
        })

# Summary of structured generation results
successful_generations = sum(1 for r in structured_results if r['status'] == 'success')
success_rate = successful_generations / len(structured_results)

print(f"\n📊 Structured Script Generation Summary:")
print(f"   Success Rate: {success_rate:.1%} ({successful_generations}/{len(structured_results)})")
print(f"   Average Script Length: {np.mean([r.get('full_text_length', 0) for r in structured_results if r['status'] == 'success']):.0f} characters")
print(f"   Status: {'✅ Excellent' if success_rate >= 0.8 else '👍 Good' if success_rate >= 0.6 else '⚠️ Needs work'}")

# Save structured results
structured_results_path = test_outputs_dir / 'structured_script_results.json'
with open(structured_results_path, 'w') as f:
    json.dump(structured_results, f, indent=2)

print(f"\n💾 Structured script results saved to: {structured_results_path}")

📝 Testing Structured Script Generation...

🧪 Test 1: Generating script for Biochemistry research...
   Article: Systematic regulation of protein during development...
✅ Structured script generated successfully!
   Title: Breakthrough in Neural Network Plasticity: How Brain Cells Adapt to New Information
   Introduction: Have you ever wondered how your brain learns and adapts throughout your life? Today we're diving int...
   Key findings: 3 items
   Validation: Title(83 chars), Findings(3 items) ✅
   💾 Saved to: structured_script_1_biochemistry.md

🧪 Test 2: Generating script for Immunology research...
   Article: Advanced mechanisms of immune during disease progression...
✅ Structured script generated successfully!
   Title: Breakthrough in Neural Network Plasticity: How Brain Cells Adapt to New Information
   Introduction: Have you ever wondered how your brain learns and adapts throughout your life? Today we're diving int...
   Key findings: 3 items
   Validation: Title(83 chars), Fi

## Section 5: Multi-Modal RAG Context Testing

> 👍 **Goal**: To generate a podcast script that is not just a summary of one article, but a narrative that places the new research into the broader context of its scientific field. This is achieved by first retrieving relevant background documents and then using them to inform the script generation, a process known as Retrieval-Augmented Generation (RAG).

This section tests the system's ability to create richer, more insightful content by understanding how a new piece of research connects to previous work.

### What is Retrieval-Augmented Generation (RAG)?

RAG is an AI technique that enhances a Large Language Model's (LLM) ability to generate accurate and context-aware responses. Instead of relying solely on its pre-trained knowledge, the model first **retrieves** relevant information from an external, up-to-date knowledge base. This retrieved information is then provided to the model as context along with the user's prompt to **generate** a more informed output.

**In our case:**
1.  **Retrieval**: Find existing research papers, background articles, or methodology documents that are related to the new scientific article we want to summarize.
2.  **Augmentation**: Feed both the new article and the retrieved documents into the LLM.
3.  **Generation**: Ask the LLM to create a podcast script that explains the new article *in the context of* the retrieved documents.

### The "Multi-Modal" Aspect in This Test

While "multi-modal" often refers to different data types (text, image, audio), here it refers to retrieving different **modalities of information** from a textual knowledge base. The system doesn't just find similar articles; it looks for different *types* of context, such as:

-   **Previous Research**: Foundational studies that the new article builds upon.
-   **Background Information**: General knowledge that helps explain the importance of the field.
-   **Methodology Documents**: Papers that explain the techniques used in the new research.

By combining these different informational modes, the AI can construct a more complete and compelling narrative.

### How the Test Works

1.  **Mock Context Database (`MockContextRetriever`)**: We simulate a knowledge base with a few pre-written documents representing "previous research," "background," and "methodology."

2.  **Context Retrieval (`find_relevant_context`)**: When a new article is processed, this retriever scans the mock database to find the most relevant documents based on keywords and field. This simulates a real-world vector database search.

3.  **Context-Aware Generation (`ContextAwareScriptGenerator`)**:
    -   It takes the new article and the list of retrieved context documents.
    -   It constructs a detailed prompt for the LLM, explicitly instructing it to use the provided context to highlight how the new research extends, confirms, or challenges previous findings.
    -   The mock LLM then generates a script that (in theory) weaves these elements together, demonstrating a deeper understanding of the scientific progression.

The ultimate test is to see if the final script successfully tells a story about science in motion, rather than just describing a single, isolated discovery.

In [15]:
# Multi-Modal RAG Context System
@dataclass
class ContextDocument:
    """Represents a context document for RAG"""
    title: str
    content: str
    doc_type: str  # 'previous_research', 'background', 'methodology'
    relevance_score: float
    metadata: Dict

class MockContextRetriever:
    """Mock context retrieval system for RAG testing"""
    
    def __init__(self):
        self.context_db = self._build_mock_context_database()
    
    def _build_mock_context_database(self) -> List[ContextDocument]:
        """Build a database of mock context documents"""
        contexts = [
            ContextDocument(
                title="Foundational Study on Neural Plasticity Mechanisms",
                content="Previous research in our laboratory established that neural plasticity involves complex molecular cascades. We identified key proteins including CREB and BDNF that regulate synaptic strength. However, the temporal dynamics and regional specificity remained unclear.",
                doc_type="previous_research",
                relevance_score=0.9,
                metadata={'year': 2022, 'citations': 156}
            ),
            ContextDocument(
                title="Methodological Advances in Protein Analysis",
                content="Recent developments in proteomics have enabled precise quantification of synaptic proteins. Mass spectrometry coupled with fluorescence microscopy provides unprecedented resolution for studying protein-protein interactions in live cells.",
                doc_type="methodology",
                relevance_score=0.8,
                metadata={'year': 2023, 'citations': 89}
            ),
            ContextDocument(
                title="Clinical Implications of Synaptic Dysfunction",
                content="Synaptic dysfunction underlies numerous neurological disorders including Alzheimer's disease and schizophrenia. Understanding molecular mechanisms of synaptic plasticity is crucial for developing targeted therapeutics.",
                doc_type="background",
                relevance_score=0.85,
                metadata={'year': 2023, 'citations': 203}
            ),
            ContextDocument(
                title="Cancer Cell Metabolism and Treatment Resistance",
                content="Our previous work demonstrated that cancer cells alter their metabolic pathways to survive chemotherapy. We identified key enzymes that could serve as therapeutic targets, but the mechanisms of resistance adaptation remained unexplored.",
                doc_type="previous_research",
                relevance_score=0.88,
                metadata={'year': 2022, 'citations': 178}
            ),
            ContextDocument(
                title="Immunotherapy Breakthrough: T-cell Engineering",
                content="Earlier studies from our group showed that engineered T-cells can effectively target cancer cells. However, off-target effects and T-cell exhaustion limited clinical applications. New approaches are needed to overcome these challenges.",
                doc_type="previous_research",
                relevance_score=0.91,
                metadata={'year': 2021, 'citations': 245}
            )
        ]
        return contexts
    
    async def find_relevant_context(self, article: Dict, max_contexts: int = 3) -> List[ContextDocument]:
        """Find relevant context documents for an article"""
        article_field = article.get('field', '').lower()
        article_text = (article.get('title', '') + ' ' + article.get('abstract', '')).lower()
        
        # Simple keyword-based matching for demo
        field_keywords = {
            'neuroscience': ['neural', 'brain', 'synap', 'neuron'],
            'cancer research': ['cancer', 'tumor', 'oncology', 'chemotherapy'],
            'immunology': ['immune', 'antibody', 'vaccine', 't-cell']
        }
        
        relevant_contexts = []
        
        for context in self.context_db:
            relevance = 0.0
            
            # Field matching
            if article_field in context.content.lower():
                relevance += 0.4
            
            # Keyword matching
            if article_field in field_keywords:
                keywords = field_keywords[article_field]
                matching_keywords = sum(1 for kw in keywords if kw in context.content.lower())
                relevance += 0.3 * (matching_keywords / len(keywords))
            
            # Content similarity (simplified)
            common_words = set(article_text.split()) & set(context.content.lower().split())
            relevance += 0.3 * (len(common_words) / 100)  # Normalize
            
            if relevance > 0.2:  # Threshold for relevance
                context.relevance_score = relevance
                relevant_contexts.append(context)
        
        # Sort by relevance and return top results
        relevant_contexts.sort(key=lambda x: x.relevance_score, reverse=True)
        return relevant_contexts[:max_contexts]

class ContextAwareScriptGenerator:
    """Generate scripts that incorporate relevant context"""
    
    def __init__(self, llm_provider, context_retriever):
        self.llm_provider = llm_provider
        self.context_retriever = context_retriever
    
    async def generate_context_aware_script(self, article: Dict) -> str:
        """Generate script with relevant context"""
        
        # Retrieve relevant context
        contexts = await self.context_retriever.find_relevant_context(article)
        
        # Build context section for prompt
        context_text = ""
        if contexts:
            context_text = "\nRELAVANT CONTEXT:\n"
            for i, ctx in enumerate(contexts, 1):
                context_text += f"\n{i}. {ctx.title} ({ctx.doc_type})\n{ctx.content}\n"
        
        # Create enhanced prompt
        prompt = f"""
        You are an expert science communicator. Generate a podcast script for the CURRENT ARTICLE.
        Use the RELEVANT CONTEXT to provide background and highlight what makes this research significant.
        
        Explicitly reference how this work builds upon or extends previous findings.
        Show the progression of scientific understanding in this field.
        
        CURRENT ARTICLE:
        Title: {article.get('title', 'Unknown')}
        Field: {article.get('field', 'Unknown')}
        Abstract: {article.get('abstract', 'No abstract available')}
        
        {context_text}
        
        Create an engaging podcast script that:
        1. Contextualizes the research within the broader scientific landscape
        2. Explains how this work builds on previous studies
        3. Highlights novel contributions and breakthroughs
        4. Makes the science accessible to a general audience
        """
        
        script = await self.llm_provider.generate_response(prompt)
        
        return script, contexts

# Test context-aware script generation
context_retriever = MockContextRetriever()
context_generator = ContextAwareScriptGenerator(mock_llm_provider, context_retriever)

print("📚 Testing Multi-Modal RAG Context-Aware Generation...")
print("=" * 60)

# Test with articles from different fields
rag_test_articles = [
    article for article in mock_articles 
    if article['field'] in ['Neuroscience', 'Cancer Research', 'Immunology']
][:3]

rag_results = []

for i, article in enumerate(rag_test_articles):
    print(f"\n🧪 RAG Test {i+1}: {article['field']} Research")
    print(f"   Article: {article['title'][:60]}...")
    
    try:
        # Generate context-aware script
        script, used_contexts = await context_generator.generate_context_aware_script(article)
        
        print(f"✅ Context-aware script generated!")
        print(f"   Contexts found: {len(used_contexts)}")
        
        for j, ctx in enumerate(used_contexts, 1):
            print(f"      {j}. {ctx.title[:50]}... (relevance: {ctx.relevance_score:.2f})")
        
        # Analyze context usage
        context_references = sum(1 for ctx in used_contexts if ctx.title.lower()[:20] in script.lower())
        script_length = len(script)
        
        # Check for context integration indicators
        context_indicators = [
            'previous work', 'earlier studies', 'builds upon', 'extends',
            'foundation', 'background research', 'prior findings'
        ]
        indicator_count = sum(1 for indicator in context_indicators if indicator in script.lower())
        
        print(f"   Script length: {script_length:,} characters")
        print(f"   Context integration indicators: {indicator_count}")
        print(f"   Context usage: {'✅ Good' if indicator_count >= 2 else '⚠️ Limited'}")
        
        # Save context-aware script
        script_filename = f"rag_script_{i+1}_{article['field'].lower().replace(' ', '_')}.md"
        script_path = test_outputs_dir / script_filename
        
        with open(script_path, 'w', encoding='utf-8') as f:
            f.write(f"# Context-Aware Script: {article['title']}\n\n")
            f.write(f"**Field:** {article['field']}\n")
            f.write(f"**Contexts Used:** {len(used_contexts)}\n\n")
            
            if used_contexts:
                f.write("## Relevant Context Sources:\n")
                for ctx in used_contexts:
                    f.write(f"- {ctx.title} (relevance: {ctx.relevance_score:.2f})\n")
                f.write("\n")
            
            f.write("## Generated Script:\n\n")
            f.write(script)
        
        print(f"   💾 Saved to: {script_filename}")
        
        rag_results.append({
            'article_id': i+1,
            'field': article['field'],
            'title': article['title'],
            'contexts_found': len(used_contexts),
            'context_usage_indicators': indicator_count,
            'script_length': script_length,
            'contexts_used': [ctx.title for ctx in used_contexts],
            'status': 'success'
        })
        
    except Exception as e:
        print(f"❌ Context-aware generation failed: {e}")
        rag_results.append({
            'article_id': i+1,
            'field': article['field'],
            'title': article['title'],
            'error': str(e),
            'status': 'error'
        })

# RAG Summary
successful_rag = sum(1 for r in rag_results if r['status'] == 'success')
rag_success_rate = successful_rag / len(rag_results)
avg_contexts = np.mean([r.get('contexts_found', 0) for r in rag_results if r['status'] == 'success'])
avg_indicators = np.mean([r.get('context_usage_indicators', 0) for r in rag_results if r['status'] == 'success'])

print(f"\n📊 RAG Context-Aware Generation Summary:")
print(f"   Success Rate: {rag_success_rate:.1%} ({successful_rag}/{len(rag_results)})")
print(f"   Average Contexts Found: {avg_contexts:.1f}")
print(f"   Average Context Indicators: {avg_indicators:.1f}")
print(f"   Context Integration: {'✅ Excellent' if avg_indicators >= 3 else '👍 Good' if avg_indicators >= 2 else '⚠️ Needs improvement'}")

# Save RAG results
rag_results_path = test_outputs_dir / 'rag_context_results.json'
with open(rag_results_path, 'w') as f:
    json.dump(rag_results, f, indent=2)

print(f"\n💾 RAG results saved to: {rag_results_path}")

📚 Testing Multi-Modal RAG Context-Aware Generation...

🧪 RAG Test 1: Immunology Research
   Article: Advanced mechanisms of immune during disease progression...
✅ Context-aware script generated!
   Contexts found: 0
   Script length: 898 characters
   Context integration indicators: 0
   Context usage: ⚠️ Limited
   💾 Saved to: rag_script_1_immunology.md

🧪 RAG Test 2: Neuroscience Research
   Article: Comprehensive regulation of neural in development...
✅ Context-aware script generated!
   Contexts found: 0
   Script length: 898 characters
   Context integration indicators: 0
   Context usage: ⚠️ Limited
   💾 Saved to: rag_script_2_neuroscience.md

🧪 RAG Test 3: Cancer Research Research
   Article: Systematic mechanisms of oncology during disease progression...
✅ Context-aware script generated!
   Contexts found: 0
   Script length: 898 characters
   Context integration indicators: 0
   Context usage: ⚠️ Limited
   💾 Saved to: rag_script_3_cancer_research.md

📊 RAG Context-Aware Gener

## Section 6: Integration Testing

Test the complete pipeline integration with all new functionalities working together.

In [16]:
# Integrated Advanced Pipeline
class AdvancedPodcastPipeline:
    """Complete advanced pipeline integrating all new functionalities"""
    
    def __init__(self, embedding_provider, llm_provider):
        # Initialize components
        self.classifier = MockScientificFieldClassifier(embedding_provider)
        self.context_retriever = MockContextRetriever()
        self.structured_generator = StructuredScriptGenerator(llm_provider)
        self.context_generator = ContextAwareScriptGenerator(llm_provider, self.context_retriever)
        
        # Train classifier if embeddings data available
        if 'embeddings_dataset' in globals():
            self.classifier.train(embeddings_dataset)
            print("✅ Classifier trained and ready")
        
        self.processing_stats = {
            'articles_processed': 0,
            'classifications_made': 0,
            'contexts_retrieved': 0,
            'scripts_generated': 0,
            'errors': 0
        }
    
    async def process_article_complete(self, article: Dict) -> Dict:
        """Process an article through the complete advanced pipeline"""
        result = {
            'article_id': article.get('pmid', 'unknown'),
            'title': article.get('title', 'Unknown'),
            'original_field': article.get('field', 'Unknown'),
            'processing_steps': {},
            'outputs': {},
            'errors': [],
            'status': 'processing'
        }
        
        try:
            # Step 1: Automatic Field Classification
            print(f"🔍 Step 1: Classifying article field...")
            if self.classifier.is_trained:
                predicted_field, confidence, top_predictions = await self.classifier.predict(article)
                result['processing_steps']['classification'] = {
                    'predicted_field': predicted_field,
                    'confidence': confidence,
                    'top_predictions': top_predictions,
                    'matches_original': predicted_field == article.get('field', ''),
                    'status': 'success'
                }
                self.processing_stats['classifications_made'] += 1
                print(f"   Predicted field: {predicted_field} (confidence: {confidence:.2%})")
            else:
                result['processing_steps']['classification'] = {
                    'status': 'skipped',
                    'reason': 'classifier_not_trained'
                }
                predicted_field = article.get('field', 'Unknown')
            
            # Step 2: Context Retrieval (RAG)
            print(f"📚 Step 2: Retrieving relevant context...")
            relevant_contexts = await self.context_retriever.find_relevant_context(article, max_contexts=3)
            result['processing_steps']['context_retrieval'] = {
                'contexts_found': len(relevant_contexts),
                'context_titles': [ctx.title for ctx in relevant_contexts],
                'avg_relevance': np.mean([ctx.relevance_score for ctx in relevant_contexts]) if relevant_contexts else 0,
                'status': 'success'
            }
            self.processing_stats['contexts_retrieved'] += len(relevant_contexts)
            print(f"   Found {len(relevant_contexts)} relevant contexts")
            
            # Step 3: Generate Structured Script
            print(f"📝 Step 3: Generating structured script...")
            structured_script = await self.structured_generator.generate_structured_script(article)
            result['processing_steps']['structured_generation'] = {
                'title': structured_script.podcast_title,
                'sections_generated': 5,  # intro, methods, findings, implications, conclusion
                'findings_count': len(structured_script.key_findings),
                'word_count': len(structured_script.model_dump_json().split()),
                'status': 'success'
            }
            result['outputs']['structured_script'] = structured_script.model_dump()
            print(f"   Generated structured script: '{structured_script.podcast_title}'")
            
            # Step 4: Generate Context-Aware Script
            print(f"🧠 Step 4: Generating context-aware script...")
            context_script, used_contexts = await self.context_generator.generate_context_aware_script(article)
            result['processing_steps']['context_aware_generation'] = {
                'contexts_used': len(used_contexts),
                'context_indicators': sum(1 for indicator in ['previous work', 'builds upon', 'earlier studies'] 
                                        if indicator in context_script.lower()),
                'script_length': len(context_script),
                'status': 'success'
            }
            result['outputs']['context_aware_script'] = context_script
            self.processing_stats['scripts_generated'] += 2  # structured + context-aware
            print(f"   Generated context-aware script ({len(context_script):,} chars)")
            
            # Step 5: Pipeline Integration Analysis
            print(f"🔗 Step 5: Pipeline integration analysis...")
            
            # Check consistency between classification and context
            classification_context_match = any(
                predicted_field.lower() in ctx.content.lower() 
                for ctx in relevant_contexts
            ) if relevant_contexts else False
            
            # Analyze script quality metrics
            structured_quality = {
                'has_all_sections': all([
                    structured_script.introduction,
                    structured_script.methods_summary,
                    structured_script.key_findings,
                    structured_script.implications_and_significance,
                    structured_script.conclusion
                ]),
                'findings_adequate': 2 <= len(structured_script.key_findings) <= 4,
                'length_appropriate': 500 <= len(structured_script.model_dump_json()) <= 2000
            }
            
            context_quality = {
                'has_context_integration': sum(1 for indicator in ['previous', 'builds', 'foundation'] 
                                             if indicator in context_script.lower()) >= 2,
                'appropriate_length': len(context_script) >= 800,
                'uses_retrieved_contexts': len(used_contexts) > 0
            }
            
            result['processing_steps']['integration_analysis'] = {
                'classification_context_consistency': classification_context_match,
                'structured_quality_score': sum(structured_quality.values()) / len(structured_quality),
                'context_quality_score': sum(context_quality.values()) / len(context_quality),
                'overall_pipeline_score': 0.0,  # Will calculate below
                'status': 'success'
            }
            
            # Calculate overall pipeline score
            pipeline_scores = [
                result['processing_steps']['classification'].get('confidence', 0.5),
                result['processing_steps']['context_retrieval']['avg_relevance'],
                result['processing_steps']['integration_analysis']['structured_quality_score'],
                result['processing_steps']['integration_analysis']['context_quality_score']
            ]
            overall_score = np.mean(pipeline_scores)
            result['processing_steps']['integration_analysis']['overall_pipeline_score'] = overall_score
            
            result['status'] = 'completed'
            self.processing_stats['articles_processed'] += 1
            
            print(f"✅ Pipeline completed! Overall score: {overall_score:.2%}")
            
        except Exception as e:
            error_msg = f"Pipeline error: {str(e)}"
            result['errors'].append(error_msg)
            result['status'] = 'error'
            self.processing_stats['errors'] += 1
            print(f"❌ {error_msg}")
        
        return result
    
    def get_pipeline_stats(self) -> Dict:
        """Get pipeline processing statistics"""
        return {
            **self.processing_stats,
            'success_rate': (self.processing_stats['articles_processed'] - self.processing_stats['errors']) / 
                          max(1, self.processing_stats['articles_processed']),
            'avg_contexts_per_article': self.processing_stats['contexts_retrieved'] / 
                                      max(1, self.processing_stats['articles_processed']),
            'avg_scripts_per_article': self.processing_stats['scripts_generated'] / 
                                     max(1, self.processing_stats['articles_processed'])
        }

# Initialize integrated pipeline
print("🚀 Initializing Advanced Podcast Pipeline...")
print("=" * 60)

advanced_pipeline = AdvancedPodcastPipeline(mock_embedding_provider, mock_llm_provider)

print("✅ Pipeline initialized successfully!")
print("   📊 Classifier: Trained and ready")
print("   📚 Context retriever: Mock database loaded")
print("   📝 Script generators: Structured + Context-aware")

🚀 Initializing Advanced Podcast Pipeline...
✅ MockScientificFieldClassifier initialized
🧠 Training scientific field classifier...
📊 Preparing training data from 50 samples...
   Features shape: (50, 768)
   Found 5 unique fields: ['Biochemistry', 'Cancer Research', 'Genetics', 'Immunology', 'Neuroscience']
   Training accuracy: 1.000
   Test accuracy: 0.400

📊 Classification Report:
                 precision    recall  f1-score   support

   Biochemistry       0.00      0.00      0.00         2
Cancer Research       0.50      1.00      0.67         2
       Genetics       1.00      0.50      0.67         2
     Immunology       0.00      0.00      0.00         2
   Neuroscience       0.25      0.50      0.33         2

       accuracy                           0.40        10
      macro avg       0.35      0.40      0.33        10
   weighted avg       0.35      0.40      0.33        10

✅ Classifier trained and ready
✅ Pipeline initialized successfully!
   📊 Classifier: Trained and r