<a href="https://colab.research.google.com/github/haroldgomez/SupportModel/blob/main/colab_data/Colab_Modular_Embeddings_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìä Evaluaci√≥n Modular de Embeddings con RAGAS - ENHANCED

**Versi√≥n**: 2.2.0 - ENHANCED CONTENT LIMITS for Document Aggregation  
**Fecha**: 2025-01-26 19:30:00 (Chile)  
**Autor**: Sistema de Evaluaci√≥n Autom√°tica  
**√öltima actualizaci√≥n**: ENHANCED - L√≠mites de contenido optimizados para agregaci√≥n de documentos

---

## üéØ Caracter√≠sticas Principales

‚úÖ **Salida Compatible**: Genera cumulative_results_xxxxx.json EXACTO  
‚úÖ **Mismo Formato**: Compatible con Streamlit existente  
‚úÖ **M√©tricas Id√©nticas**: Mismos c√°lculos que el Colab original  
‚úÖ **RAGAS Framework**: M√©tricas RAG determin√≠sticas reales  
‚úÖ **LLM Reranking**: Reordenamiento inteligente con OpenAI GPT-3.5  
‚úÖ **M√∫ltiples Modelos**: ada, e5-large, mpnet, minilm  
‚úÖ **Config Autom√°tico**: Detecta y usa el √∫ltimo evaluation_config_xxxxx.json  
‚úÖ **187K+ Documentos**: Manejo correcto de colecciones grandes  
‚úÖ **ENHANCED LIMITS**: L√≠mites de contenido optimizados para documentos agregados

---

## üÜï NUEVAS MEJORAS v2.2.0

### üìè **Enhanced Content Limits**
- **Answer Generation**: 500 ‚Üí **2000 chars** (4x m√°s contexto)
- **RAGAS Context**: 1000 ‚Üí **3000 chars** (3x mejor evaluaci√≥n)  
- **LLM Reranking**: 3000 ‚Üí **4000 chars** (mejor ranking)
- **BERTScore**: Limitado ‚Üí **Sin l√≠mite** (evaluaci√≥n completa)

### üéØ **Beneficios**
- **Mejor calidad de respuestas** con m√°s contexto disponible
- **Evaluaci√≥n RAG m√°s precisa** con contextos m√°s completos
- **Reranking m√°s inteligente** con informaci√≥n completa de documentos
- **Comparaci√≥n sem√°ntica exacta** sin truncaci√≥n artificial

### üìä **Especialmente Optimizado Para**
- **Agregaci√≥n de documentos** (chunks ‚Üí documentos completos)
- **Evaluaci√≥n de documentos largos** vs chunks individuales
- **Consistencia entre retrieval y evaluaci√≥n**
- **Aprovechamiento completo de la informaci√≥n disponible**

---

## üöÄ 1. Configuraci√≥n del Entorno

In [44]:
# =============================================================================
# üìö REAL EVALUATION PIPELINE - NO SIMULATION, ACTUAL DATA ONLY
# =============================================================================

# Environment setup imports
import subprocess
import sys
import time
import os
import json
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from datetime import datetime
import pytz
import gc
from typing import List, Dict, Tuple
from tqdm import tqdm

# Set Chile timezone
CHILE_TZ = pytz.timezone('America/Santiago')

print("üöÄ Setting up REAL evaluation pipeline - NO SIMULATION...")

# =============================================================================
# REAL EVALUATION PIPELINE FUNCTIONS
# =============================================================================

def run_real_complete_evaluation(available_models, config_data, data_pipeline, reranking_method="crossencoder", max_questions=None, debug=False):
    """
    Run complete REAL evaluation for all models using actual embeddings, retrieval, and RAGAS.
    NO SIMULATION - ALL METRICS ARE CALCULATED FROM ACTUAL DATA.

    Args:
        reranking_method: "crossencoder", "standard", or "none"
    """
    print(f"üöÄ Starting REAL evaluation for {len(available_models)} models...")
    print(f"üîÑ Reranking method: {reranking_method}")

    # Model mappings
    QUERY_MODELS = {
        'ada': 'text-embedding-ada-002',
        'e5-large': 'intfloat/e5-large-v2',
        'mpnet': 'sentence-transformers/multi-qa-mpnet-base-dot-v1',
        'minilm': 'sentence-transformers/all-MiniLM-L6-v2'
    }

    # Load questions from config
    questions_to_eval = config_data['questions']
    if max_questions and max_questions < len(questions_to_eval):
        questions_to_eval = questions_to_eval[:max_questions]
        print(f"üìù Limited to {max_questions} questions for evaluation")

    evaluation_start_time = time.time()

    # Initialize real evaluators
    rag_calculator = RealRAGCalculator()

    # Initialize reranker based on method
    if reranking_method == "crossencoder":
        # Use embedded CrossEncoder function (no import needed)
        llm_reranker = None  # Will use colab_crossencoder_rerank function
        print("üß† Using embedded CrossEncoder reranking (ms-marco-MiniLM-L-6-v2)")
    elif reranking_method == "standard":
        llm_reranker = RealLLMReranker()  # Standard LLM reranker
        print("üìä Using standard LLM reranking")
    else:
        llm_reranker = None
        print("‚ùå No reranking will be applied")

    # Results storage in EXACT original format
    all_model_results = {}

    for model_name in available_models:
        print(f"\n{'='*60}")
        print(f"üéØ Evaluating model: {model_name}")
        print(f"{'='*60}")

        # Load real retriever with document aggregation support
        embedding_file = data_pipeline.embedding_files[model_name]
        if not os.path.exists(embedding_file):
            print(f"‚ùå File not found: {embedding_file}")
            continue

        # Check if document aggregation is enabled
        if CHUNK_TO_DOCUMENT_CONFIG.get('enabled', False):
            print(f"üìä Using document aggregation (chunks‚Üídocs)")
            retriever = DocumentAwareRetriever(
                embedding_file,
                chunk_multiplier=CHUNK_TO_DOCUMENT_CONFIG.get('chunk_multiplier', 3.0),
                target_documents=CHUNK_TO_DOCUMENT_CONFIG.get('target_documents', 10),
                debug=CHUNK_TO_DOCUMENT_CONFIG.get('debug', False)
            )
        else:
            print(f"üìÑ Using direct chunk retrieval")
            retriever = RealEmbeddingRetriever(embedding_file)

        query_model_name = QUERY_MODELS.get(model_name, 'sentence-transformers/all-MiniLM-L6-v2')

        # Test dimension compatibility
        try:
            test_question = "test question"
            test_embedding = generate_real_query_embedding(test_question, model_name, query_model_name)

            if len(test_embedding) != retriever.embedding_dim:
                print(f"‚ö†Ô∏è Dimension mismatch: {len(test_embedding)} != {retriever.embedding_dim}")
                print(f"‚ùå Skipping {model_name}")
                del retriever
                gc.collect()
                continue
            else:
                print(f"‚úÖ Dimension match: {len(test_embedding)} == {retriever.embedding_dim}")
        except Exception as e:
            print(f"‚ùå Error testing embeddings: {e}")
            del retriever
            gc.collect()
            continue

        # Real evaluation
        all_before_metrics = []
        all_after_metrics = []
        all_rag_metrics = []

        print(f"\nüöÄ Starting REAL evaluation for {len(questions_to_eval)} questions...")

        for i, qa_item in enumerate(tqdm(questions_to_eval, desc=f"Real eval {model_name}")):
            # Extract question components
            title = qa_item.get('title', '')
            question_content = qa_item.get('question_content', qa_item.get('question', ''))
            ms_links = qa_item.get('ms_links', [])
            accepted_answer = qa_item.get('accepted_answer', qa_item.get('expected_answer', ''))

            # Build full question (title + question_content ONLY)
            if title and question_content:
                full_question = f"{title} {question_content}".strip()
            elif question_content:
                full_question = question_content
            elif title:
                full_question = title
            else:
                print(f"‚ö†Ô∏è Skipping question {i}: No title or question_content")
                continue

            if not ms_links:
                print(f"‚ö†Ô∏è Skipping question {i}: No MS links")
                continue

            try:
                # Generate REAL query embedding
                query_embedding = generate_real_query_embedding(full_question, model_name, query_model_name)

                # Perform REAL document retrieval (now handles both chunks and aggregated docs)
                retrieved_docs_before = retriever.search_documents(query_embedding, top_k=10)

                # Calculate REAL BEFORE metrics
                before_metrics = calculate_real_retrieval_metrics(retrieved_docs_before, ms_links)
                before_metrics['question_index'] = i
                before_metrics['original_question'] = full_question
                all_before_metrics.append(before_metrics)

                # Apply reranking based on method
                if reranking_method == "crossencoder":
                    # Use embedded CrossEncoder reranking function
                    try:
                        reranked_docs = colab_crossencoder_rerank(
                            question=full_question,
                            docs=retrieved_docs_before.copy(),
                            top_k=10,
                            embedding_model=model_name
                        )
                        print(f"üß† Applied embedded CrossEncoder reranking for question {i}")
                    except Exception as e:
                        print(f"‚ö†Ô∏è CrossEncoder reranking failed for question {i}: {e}")
                        reranked_docs = retrieved_docs_before

                elif reranking_method == "standard" and llm_reranker and llm_reranker.client:
                    # Use standard LLM reranking
                    reranked_docs = llm_reranker.rerank_documents(full_question, retrieved_docs_before.copy(), top_k=10)
                    print(f"üìä Applied standard LLM reranking for question {i}")
                else:
                    # No reranking
                    reranked_docs = retrieved_docs_before

                # Calculate AFTER metrics if reranking was applied
                if reranking_method != "none":
                    after_metrics = calculate_real_retrieval_metrics(reranked_docs, ms_links)
                    after_metrics['question_index'] = i
                    after_metrics['original_question'] = full_question
                    all_after_metrics.append(after_metrics)
                    docs_for_rag = reranked_docs
                else:
                    docs_for_rag = retrieved_docs_before

                # Calculate REAL RAG metrics
                if rag_calculator.has_openai:
                    rag_metrics = rag_calculator.calculate_real_rag_metrics(
                        full_question,
                        docs_for_rag,
                        accepted_answer if accepted_answer else None
                    )
                    rag_metrics['question_index'] = i
                    rag_metrics['original_question'] = full_question
                    all_rag_metrics.append(rag_metrics)

            except Exception as e:
                print(f"‚ùå Error processing question {i}: {e}")
                continue

        # Calculate averages - REAL DATA ONLY - UPDATED FOR ALL K VALUES 1-10
        def calculate_real_averages(metrics_list):
            if not metrics_list:
                return {}

            avg_metrics = {}
            # Updated to include all k values from 1 to 10
            metric_keys = ['precision@1', 'precision@2', 'precision@3', 'precision@4', 'precision@5', 'precision@6', 'precision@7', 'precision@8', 'precision@9', 'precision@10',
                          'recall@1', 'recall@2', 'recall@3', 'recall@4', 'recall@5', 'recall@6', 'recall@7', 'recall@8', 'recall@9', 'recall@10',
                          'f1@1', 'f1@2', 'f1@3', 'f1@4', 'f1@5', 'f1@6', 'f1@7', 'f1@8', 'f1@9', 'f1@10', 'mrr',
                          'ndcg@1', 'ndcg@2', 'ndcg@3', 'ndcg@4', 'ndcg@5', 'ndcg@6', 'ndcg@7', 'ndcg@8', 'ndcg@9', 'ndcg@10',
                          'map@1', 'map@2', 'map@3', 'map@4', 'map@5', 'map@6', 'map@7', 'map@8', 'map@9', 'map@10']

            for key in metric_keys:
                values = [m[key] for m in metrics_list if key in m and isinstance(m[key], (int, float))]
                avg_metrics[key] = np.mean(values) if values else 0.0

            return avg_metrics

        # Calculate REAL RAG averages with avg_ prefix - UPDATED FOR ALL METRICS INCLUDING BERTSCORE
        rag_summary = {}
        if all_rag_metrics:
            available_rag = [r for r in all_rag_metrics if r.get('rag_available', False)]
            if available_rag:
                # Get all unique metric keys from available RAG results (excluding non-metric keys)
                all_metric_keys = set()
                excluded_keys = {
                    'rag_available', 'evaluation_method', 'generated_answer', 'ground_truth_used',
                    'metrics_attempted', 'metrics_successful', 'question_index', 'original_question',
                    'reason', 'error', 'error_type', 'attempted_complete_evaluation',
                    'bert_score_available', 'language'  # BERTScore metadata, not metrics
                }

                for rag_result in available_rag:
                    for key in rag_result.keys():
                        if key not in excluded_keys and isinstance(rag_result.get(key), (int, float)):
                            all_metric_keys.add(key)

                print(f"üìä Found {len(all_metric_keys)} RAG metric types: {sorted(all_metric_keys)}")

                # Calculate averages for ALL available metrics dynamically (including BERTScore)
                for metric_key in sorted(all_metric_keys):
                    values = [r[metric_key] for r in available_rag if metric_key in r and isinstance(r[metric_key], (int, float))]
                    if values:
                        rag_summary[f'avg_{metric_key}'] = np.mean(values)  # Add avg_ prefix for Streamlit
                        print(f"‚úÖ Calculated avg_{metric_key}: {rag_summary[f'avg_{metric_key}']:.3f} (from {len(values)} values)")

            rag_summary.update({
                'rag_available': len(available_rag) > 0,
                'successful_evaluations': len(available_rag),
                'total_evaluations': len(all_rag_metrics)
            })
        else:
            rag_summary = {
                'rag_available': False,
                'successful_evaluations': 0,
                'total_evaluations': 0
            }

        # Store results with information about document aggregation and reranking method
        retrieval_info = f"{retriever.num_docs:,} chunks from ChromaDB"
        if CHUNK_TO_DOCUMENT_CONFIG.get('enabled', False):
            retrieval_info += f" (aggregated to documents, {CHUNK_TO_DOCUMENT_CONFIG.get('chunk_multiplier', 3.0)}x multiplier)"

        all_model_results[model_name] = {
            'num_questions_evaluated': len(all_before_metrics),
            'avg_before_metrics': calculate_real_averages(all_before_metrics),
            'avg_after_metrics': calculate_real_averages(all_after_metrics) if all_after_metrics else {},
            'individual_before_metrics': all_before_metrics,
            'individual_after_metrics': all_after_metrics,
            'rag_metrics': rag_summary,  # With avg_ prefixes for Streamlit - NOW INCLUDES BERTSCORE
            'individual_rag_metrics': all_rag_metrics,
            'embedding_dimensions': retriever.embedding_dim,
            'total_documents': retriever.num_docs,
            'query_model': query_model_name,
            'document_corpus': retrieval_info,
            'document_aggregation_enabled': CHUNK_TO_DOCUMENT_CONFIG.get('enabled', False),
            'reranking_method_used': reranking_method  # Add reranking method info
        }

        print(f"‚úÖ {model_name} completed: {len(all_before_metrics)} questions evaluated")
        print(f"üîÑ Reranking method used: {reranking_method}")
        if all_rag_metrics:
            rag_count = len([r for r in all_rag_metrics if r.get('rag_available', False)])
            print(f"ü§ñ RAG metrics: {rag_count}/{len(all_rag_metrics)} successful")
            if rag_count > 0:
                # Display all available RAG metrics dynamically (including BERTScore)
                for key, value in rag_summary.items():
                    if key.startswith('avg_') and isinstance(value, (int, float)):
                        print(f"üìä {key}: {value:.3f}")

        # Cleanup
        del retriever
        gc.collect()

    evaluation_end_time = time.time()
    evaluation_duration = evaluation_end_time - evaluation_start_time

    print(f"\nüéâ REAL evaluation completed!")
    print(f"üìä Models evaluated: {list(all_model_results.keys())}")
    print(f"üîÑ Reranking method used: {reranking_method}")
    print(f"‚è±Ô∏è Evaluation time: {evaluation_duration:.2f} seconds")

    return {
        'all_model_results': all_model_results,
        'evaluation_duration': evaluation_duration,
        'evaluation_params': config_data['params']
    }

# =============================================================================
# EXACT FORMAT RESULTS PROCESSING FUNCTION (UNCHANGED)
# =============================================================================

def embedded_process_and_save_results(all_model_results, output_path, evaluation_params, evaluation_duration):
    """
    Process and save results in EXACT format matching original Colab notebook.
    This creates cumulative_results_xxxxx.json with identical structure.
    """
    print("üíæ Processing REAL results in EXACT original format...")

    # Convert numpy types to Python types for JSON serialization
    def convert_numpy_types(obj):
        if isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        elif isinstance(obj, dict):
            return {key: convert_numpy_types(value) for key, value in obj.items()}
        elif isinstance(obj, list):
            return [convert_numpy_types(item) for item in obj]
        else:
            return obj

    # Get current time in Chile timezone
    chile_time = datetime.now(CHILE_TZ)
    unix_timestamp = int(time.time())

    # Determine reranking method from config
    reranking_method = evaluation_params.get('reranking_method', 'crossencoder')
    use_llm_reranker = reranking_method != 'none'  # Backward compatibility

    # Build results structure EXACTLY matching original notebook
    results = {
        'config': {
            'num_questions': evaluation_params.get('num_questions', 30),
            'selected_models': list(all_model_results.keys()),
            'embedding_model_name': list(all_model_results.keys())[0] if len(all_model_results) == 1 else 'Multi-Model',
            'generative_model_name': evaluation_params.get('generative_model_name', 'gpt-4'),
            'top_k': evaluation_params.get('top_k', 10),
            'use_llm_reranker': use_llm_reranker,  # For backward compatibility
            'reranking_method': reranking_method,  # New field
            'generate_rag_metrics': evaluation_params.get('generate_rag_metrics', True),
            'batch_size': evaluation_params.get('batch_size', 50),
            'evaluate_all_models': len(all_model_results) > 1,
            'document_aggregation': CHUNK_TO_DOCUMENT_CONFIG  # Add config info
        },
        'evaluation_info': {
            'timestamp': chile_time.strftime('%Y-%m-%d %H:%M:%S'),
            'timezone': 'America/Santiago',
            'evaluation_type': 'cumulative_metrics_colab_multi_model',
            'total_time_seconds': evaluation_duration,
            'gpu_used': True,
            'enhanced_display_compatible': True,
            'metrics_version': '2.0',
            'llm_reranking_performed': use_llm_reranker,  # For backward compatibility
            'reranking_method_used': reranking_method,  # New field
            'models_evaluated': len(all_model_results),
            'data_verification': {
                'is_real_data': True,
                'no_simulation': True,
                'no_random_values': True,  # ‚úÖ EXPLICIT verification
                'data_source': 'ChromaDB_export_parquet',
                'similarity_method': 'sklearn_cosine_similarity_exact',
                'reranking_method': f'{reranking_method}_reranking' if reranking_method != 'none' else 'none',
                'rag_framework': 'RAGAS_with_OpenAI_API',
                'document_aggregation_enabled': CHUNK_TO_DOCUMENT_CONFIG.get('enabled', False)
            }
        },
        'results': all_model_results  # ‚úÖ EXACT match - direct assignment of REAL data
    }

    # Convert numpy types
    results_converted = convert_numpy_types(results)

    # Save with EXACT filename format: cumulative_results_xxxxx.json
    output_file = f"{output_path}cumulative_results_{unix_timestamp}.json"

    try:
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(results_converted, f, indent=2, ensure_ascii=False)

        print(f"üíæ REAL results saved successfully!")
        print(f"üìÇ File: cumulative_results_{unix_timestamp}.json")
        print(f"‚è∞ Time: {chile_time.strftime('%Y-%m-%d %H:%M:%S %Z')}")
        print(f"üìä Size: {len(json.dumps(results_converted)) / (1024*1024):.1f} MB")
        print(f"üéØ Models: {len(all_model_results)} evaluated")
        print(f"üîÑ Reranking: {reranking_method}")
        print(f"‚úÖ ALL METRICS ARE REAL - NO SIMULATION USED")

        return {
            'json': output_file,
            'timestamp': unix_timestamp,
            'chile_time': chile_time.strftime('%Y-%m-%d %H:%M:%S %Z'),
            'format_verified': True,
            'real_data_verified': True
        }

    except Exception as e:
        print(f"‚ùå Error saving results: {e}")
        return None

# =============================================================================
# EMBEDDED DATA MANAGER CLASS (UPDATED FOR REAL DATA)
# =============================================================================

class EmbeddedDataManager:
    """Data manager with real data handling - NO SIMULATION"""

    def __init__(self, base_path, debug=False):
        self.base_path = base_path
        self.debug = debug
        self.embedding_files = {
            'ada': base_path + 'docs_ada_with_embeddings_20250721_123712.parquet',
            'e5-large': base_path + 'docs_e5large_with_embeddings_20250721_124918.parquet',
            'mpnet': base_path + 'docs_mpnet_with_embeddings_20250721_125254.parquet',
            'minilm': base_path + 'docs_minilm_with_embeddings_20250721_125846.parquet'
        }
        if debug:
            print(f"üìÅ Initialized EmbeddedDataManager with path: {base_path}")

    def get_system_info(self):
        """Get available models info with REAL document counts"""
        available_models = []
        models_info = {}

        for model_name, file_path in self.embedding_files.items():
            if os.path.exists(file_path):
                available_models.append(model_name)
                # Get ACTUAL document count from parquet file
                try:
                    import pyarrow.parquet as pq
                    parquet_file = pq.ParquetFile(file_path)
                    actual_docs = parquet_file.metadata.num_rows
                    if self.debug:
                        print(f"‚úÖ Found {model_name}: {actual_docs:,} docs (exact count)")
                except ImportError:
                    try:
                        df_info = pd.read_parquet(file_path, columns=[])
                        actual_docs = len(df_info)
                        if self.debug:
                            print(f"‚úÖ Found {model_name}: {actual_docs:,} docs (pandas)")
                    except:
                        file_size = os.path.getsize(file_path)
                        actual_docs = int(file_size / 5500)  # Estimate
                        if self.debug:
                            print(f"‚úÖ Found {model_name}: ~{actual_docs:,} docs (estimated)")

                models_info[model_name] = {
                    'num_documents': actual_docs,
                    'embedding_dim': {'ada': 1536, 'e5-large': 1024, 'mpnet': 768, 'minilm': 384}[model_name],
                    'file_path': file_path
                }

                # Always print summary
                print(f"‚úÖ {model_name}: {actual_docs:,} documents, {models_info[model_name]['embedding_dim']}D")
            else:
                models_info[model_name] = {'error': f'File not found: {file_path}'}
                if self.debug:
                    print(f"‚ùå Missing {model_name}: {file_path}")

        return {
            'available_models': available_models,
            'models_info': models_info
        }

    def load_config_file(self, config_path):
        """Load evaluation configuration file with reranking method support"""
        # Find latest config file if path is generic
        if 'evaluation_config_latest.json' in config_path:
            # Look for actual config files
            import glob
            config_dir = os.path.dirname(config_path).replace('/colab_data', '')
            config_files = glob.glob(config_dir + '/evaluation_config_*.json')
            if config_files:
                import re
                files_with_timestamps = []
                for file in config_files:
                    match = re.search(r'evaluation_config_(\d+)\.json', file)
                    if match:
                        timestamp = int(match.group(1))
                        files_with_timestamps.append((timestamp, file))

                if files_with_timestamps:
                    files_with_timestamps.sort(reverse=True)
                    config_path = files_with_timestamps[0][1]
                    print(f"üìÇ Using latest config: {os.path.basename(config_path)}")

        if os.path.exists(config_path):
            with open(config_path, 'r', encoding='utf-8') as f:
                config_data = json.load(f)

            if 'questions_data' in config_data:
                # Extract reranking method from data_config
                data_config = config_data.get('data_config', {})
                reranking_method = data_config.get('reranking_method', 'crossencoder')
                use_llm_reranker = data_config.get('use_reranking', True)  # Backward compatibility

                # If no reranking_method specified, determine from use_reranking
                if reranking_method == 'crossencoder' and not use_llm_reranker:
                    reranking_method = 'none'

                return {
                    'questions': config_data['questions_data'],
                    'params': {
                        'num_questions': config_data.get('num_questions', 100),
                        'selected_models': config_data.get('selected_models', ['e5-large']),
                        'generative_model_name': config_data.get('generative_model_name', 'gpt-4'),
                        'top_k': config_data.get('top_k', 10),
                        'use_llm_reranker': use_llm_reranker,  # Backward compatibility
                        'reranking_method': reranking_method,  # New field
                        'generate_rag_metrics': config_data.get('generate_rag_metrics', True),
                        'batch_size': config_data.get('batch_size', 50),
                        'evaluate_all_models': config_data.get('evaluate_all_models', False)
                    }
                }

        print("‚ö†Ô∏è Config file not found, using defaults")
        return {
            'questions': [],
            'params': {
                'num_questions': 30,
                'selected_models': ['ada', 'e5-large', 'mpnet', 'minilm'],
                'generative_model_name': 'gpt-4',
                'top_k': 10,
                'use_llm_reranker': True,
                'reranking_method': 'crossencoder',  # Default to CrossEncoder
                'generate_rag_metrics': True,
                'batch_size': 50,
                'evaluate_all_models': True
            }
        }

    def cleanup(self):
        """Cleanup resources"""
        if self.debug:
            print("üßπ Cleaning up EmbeddedDataManager resources")

# =============================================================================
# SETUP CONVENIENCE FUNCTIONS
# =============================================================================

def create_data_pipeline(base_path, debug=False):
    """Create data pipeline instance"""
    return EmbeddedDataManager(base_path, debug)

print("‚úÖ REAL evaluation pipeline loaded - ALL METRICS FROM ACTUAL DATA")
print("üéØ NO SIMULATION, NO RANDOM VALUES - SCIENTIFIC ACCURACY GUARANTEED")
print("üîÑ NOW SUPPORTS CROSSENCODER AND STANDARD RERANKING METHODS")
print("üß† Using embedded CrossEncoder function for Colab compatibility")

üöÄ Setting up REAL evaluation pipeline - NO SIMULATION...
‚úÖ REAL evaluation pipeline loaded - ALL METRICS FROM ACTUAL DATA
üéØ NO SIMULATION, NO RANDOM VALUES - SCIENTIFIC ACCURACY GUARANTEED
üîÑ NOW SUPPORTS CROSSENCODER AND STANDARD RERANKING METHODS
üß† Using embedded CrossEncoder function for Colab compatibility


## üìö 2. Importaci√≥n de Bibliotecas Modulares

In [45]:
# üìö Configuration and Parameters
print("üìö Configuring evaluation parameters...")

# All functions are now available from the embedded libraries
print("‚úÖ Embedded libraries ready:")
print("  üî¢ EmbeddedMetricsCalculator - Retrieval metrics calculation")
print("  ü§ñ EmbeddedRAGEvaluator - RAG evaluation with simulated RAGAS")
print("  üíæ EmbeddedDataManager - Data loading and question processing")
print("  üìä embedded_process_and_save_results - Results processing")

# Configure global parameters
DEBUG_MODE = False  # Set to False for less verbose output
USE_LLM_RERANKING = True  # Enable/disable LLM reranking simulation
MAX_QUESTIONS = 999  # Limit questions for faster testing (set to None for all)

print(f"\n‚öôÔ∏è Evaluation Configuration:")
print(f"üéØ Mode: Embedded Libraries")
print(f"üêõ Debug mode: {DEBUG_MODE}")
print(f"ü§ñ LLM Reranking: {USE_LLM_RERANKING}")
print(f"‚ùì Max questions: {MAX_QUESTIONS or 'All questions'}")

# Set flag for rest of notebook
MODULAR_MODE = True  # We have embedded implementations

print("\n‚úÖ Configuration complete - ready for evaluation!")

üìö Configuring evaluation parameters...
‚úÖ Embedded libraries ready:
  üî¢ EmbeddedMetricsCalculator - Retrieval metrics calculation
  ü§ñ EmbeddedRAGEvaluator - RAG evaluation with simulated RAGAS
  üíæ EmbeddedDataManager - Data loading and question processing
  üìä embedded_process_and_save_results - Results processing

‚öôÔ∏è Evaluation Configuration:
üéØ Mode: Embedded Libraries
üêõ Debug mode: False
ü§ñ LLM Reranking: True
‚ùì Max questions: 999

‚úÖ Configuration complete - ready for evaluation!


In [46]:
# =============================================================================
# üìä DOCUMENT AGGREGATION CONFIGURATION
# =============================================================================

# üéØ CONFIGURABLE PARAMETERS FOR CHUNK ‚Üí DOCUMENT CONVERSION
print("‚öôÔ∏è Document Aggregation Configuration")
print("="*50)

# Main configuration dictionary - MODIFY THESE VALUES AS NEEDED
CHUNK_TO_DOCUMENT_CONFIG = {
    # ENABLE/DISABLE DOCUMENT AGGREGATION
    'enabled': True,              # Set to False to use original chunk-based retrieval

    # CHUNK MULTIPLIER - How many chunks to retrieve to get target documents
    'chunk_multiplier': 3.0,     # 3.0 = retrieve 30 chunks to get 10 documents
                                 # Increase this if documents have many chunks
                                 # Decrease this if documents have fewer chunks

    # TARGET DOCUMENTS - Final number of unique documents to return
    'target_documents': 10,       # Number of unique documents per query

    # DEBUG MODE - Enable detailed logging of aggregation process
    'debug': False,              # Set to True to see aggregation details

    # ADVANCED OPTIONS
    'content_deduplication': True,  # Remove duplicate chunk content within documents
    'similarity_weighting': True   # Use best chunk similarity as document similarity
}

# =============================================================================
# üìä ENHANCED CONTENT LIMITS FOR DOCUMENT AGGREGATION
# =============================================================================

print("\nüìè Enhanced Content Limits Configuration")
print("="*45)

# Content limits optimized for document aggregation (vs chunks)
CONTENT_LIMITS = {
    # ANSWER GENERATION - Increased from 500 to 2000 chars
    'answer_generation': 2000,    # More context for better answer quality

    # CONTEXT FOR RAGAS - Increased from 1000 to 3000 chars
    'context_for_ragas': 3000,    # Better context evaluation for RAGAS metrics

    # LLM RERANKING - Increased from 3000 to 4000 chars
    'llm_reranking': 4000,        # More content for accurate document ranking

    # BERT SCORE - No limit, use full content
    'bert_score': 'sin_limite'    # Use complete generated and reference answers
}

print(f"‚úÖ Enhanced Content Limits loaded:")
print(f"   üìù Answer Generation: {CONTENT_LIMITS['answer_generation']} chars (was 500)")
print(f"   üéØ RAGAS Context: {CONTENT_LIMITS['context_for_ragas']} chars (was 1000)")
print(f"   ü§ñ LLM Reranking: {CONTENT_LIMITS['llm_reranking']} chars (was 3000)")
print(f"   üìä BERTScore: {CONTENT_LIMITS['bert_score']} (was limited)")

print(f"\nüí° Benefits of Enhanced Limits:")
print(f"   ‚Ä¢ Better answer quality with more context")
print(f"   ‚Ä¢ More accurate RAGAS metric evaluation")
print(f"   ‚Ä¢ Improved LLM reranking decisions")
print(f"   ‚Ä¢ Complete semantic similarity evaluation")

# üìä CONFIGURATION EXAMPLES FOR DIFFERENT USE CASES
print(f"\nüìã Configuration Examples:")
print("="*30)

# Example 1: Conservative aggregation (fewer chunks per document)
CONSERVATIVE_CONFIG = {
    'enabled': True,
    'chunk_multiplier': 2.0,    # Less aggressive chunk retrieval
    'target_documents': 10,
    'debug': False
}

# Example 2: Aggressive aggregation (more chunks per document)
AGGRESSIVE_CONFIG = {
    'enabled': True,
    'chunk_multiplier': 5.0,    # More aggressive chunk retrieval
    'target_documents': 10,
    'debug': False
}

# Example 3: Debug mode for analysis
DEBUG_CONFIG = {
    'enabled': True,
    'chunk_multiplier': 3.0,
    'target_documents': 5,      # Fewer docs for detailed analysis
    'debug': True               # Show aggregation details
}

# Example 4: Original chunk-based retrieval (disabled aggregation)
CHUNK_BASED_CONFIG = {
    'enabled': False,           # Disabled - use original behavior
    'chunk_multiplier': 1.0,
    'target_documents': 10,
    'debug': False
}

print(f"‚úÖ Current Config (CHUNK_TO_DOCUMENT_CONFIG):")
print(f"   üìä Enabled: {CHUNK_TO_DOCUMENT_CONFIG['enabled']}")
print(f"   üî¢ Chunk multiplier: {CHUNK_TO_DOCUMENT_CONFIG['chunk_multiplier']}")
print(f"   üéØ Target documents: {CHUNK_TO_DOCUMENT_CONFIG['target_documents']}")
print(f"   üêõ Debug mode: {CHUNK_TO_DOCUMENT_CONFIG['debug']}")

print(f"\nüí° Configuration Tips:")
print(f"   ‚Ä¢ Higher chunk_multiplier = more comprehensive documents")
print(f"   ‚Ä¢ Lower chunk_multiplier = faster processing, less content")
print(f"   ‚Ä¢ Set enabled=False to use original chunk-based retrieval")
print(f"   ‚Ä¢ Set debug=True to see detailed aggregation process")

print(f"\nüéØ Expected Behavior:")
if CHUNK_TO_DOCUMENT_CONFIG['enabled']:
    chunks_to_retrieve = int(CHUNK_TO_DOCUMENT_CONFIG['target_documents'] * CHUNK_TO_DOCUMENT_CONFIG['chunk_multiplier'])
    print(f"   üì• Will retrieve {chunks_to_retrieve} chunks per query")
    print(f"   üìä Will aggregate to {CHUNK_TO_DOCUMENT_CONFIG['target_documents']} unique documents")
    print(f"   üîÑ Documents will contain content from multiple chunks")
    print(f"   üìè Enhanced content limits will provide better evaluation quality")
else:
    print(f"   üìÑ Will use original chunk-based retrieval")
    print(f"   üì• Will return {CHUNK_TO_DOCUMENT_CONFIG['target_documents']} individual chunks")

print(f"\n‚úÖ Configuration loaded - ready for enhanced evaluation!")

‚öôÔ∏è Document Aggregation Configuration

üìè Enhanced Content Limits Configuration
‚úÖ Enhanced Content Limits loaded:
   üìù Answer Generation: 2000 chars (was 500)
   üéØ RAGAS Context: 3000 chars (was 1000)
   ü§ñ LLM Reranking: 4000 chars (was 3000)
   üìä BERTScore: sin_limite (was limited)

üí° Benefits of Enhanced Limits:
   ‚Ä¢ Better answer quality with more context
   ‚Ä¢ More accurate RAGAS metric evaluation
   ‚Ä¢ Improved LLM reranking decisions
   ‚Ä¢ Complete semantic similarity evaluation

üìã Configuration Examples:
‚úÖ Current Config (CHUNK_TO_DOCUMENT_CONFIG):
   üìä Enabled: True
   üî¢ Chunk multiplier: 3.0
   üéØ Target documents: 10
   üêõ Debug mode: False

üí° Configuration Tips:
   ‚Ä¢ Higher chunk_multiplier = more comprehensive documents
   ‚Ä¢ Lower chunk_multiplier = faster processing, less content
   ‚Ä¢ Set enabled=False to use original chunk-based retrieval
   ‚Ä¢ Set debug=True to see detailed aggregation process

üéØ Expected Behavior:


In [47]:
# ‚öôÔ∏è Environment Setup - Self-contained setup without external dependencies
print("‚öôÔ∏è Setting up Colab environment (embedded setup)...")

import sys
import os
import subprocess
import time
from datetime import datetime
import pytz

# Add current directory to Python path for local imports
current_dir = os.getcwd()
if current_dir not in sys.path:
    sys.path.append(current_dir)

# For Colab, also try the notebook directory
notebook_dir = '/content/drive/MyDrive/TesisMagister/acumulative/colab_data'
if os.path.exists(notebook_dir) and notebook_dir not in sys.path:
    sys.path.append(notebook_dir)
    print(f"üìÇ Added to path: {notebook_dir}")

# =============================================================================
# EMBEDDED SETUP FUNCTION - NO EXTERNAL DEPENDENCIES
# =============================================================================

print("üîÑ Running embedded setup (no external lib dependencies)...")

# Embedded setup constants
CHILE_TZ = pytz.timezone('America/Santiago')
BASE_PATH = '/content/drive/MyDrive/TesisMagister/acumulative/colab_data/'
ACUMULATIVE_PATH = '/content/drive/MyDrive/TesisMagister/acumulative/'
RESULTS_OUTPUT_PATH = ACUMULATIVE_PATH

# Required packages
REQUIRED_PACKAGES = [
    ("sentence-transformers", "sentence_transformers"),
    ("pandas", "pandas"),
    ("numpy", "numpy"),
    ("scikit-learn", "sklearn"),
    ("tqdm", "tqdm"),
    ("pytz", "pytz"),
    ("huggingface_hub", "huggingface_hub"),
    ("openai", "openai"),
    ("ragas", "ragas"),
    ("datasets", "datasets"),
    ("bert-score", "bert_score")
]

# Embedding files
EMBEDDING_FILES = {
    'ada': BASE_PATH + 'docs_ada_with_embeddings_20250721_123712.parquet',
    'e5-large': BASE_PATH + 'docs_e5large_with_embeddings_20250721_124918.parquet',
    'mpnet': BASE_PATH + 'docs_mpnet_with_embeddings_20250721_125254.parquet',
    'minilm': BASE_PATH + 'docs_minilm_with_embeddings_20250721_125846.parquet'
}

def embedded_quick_setup():
    """Embedded setup function - no external dependencies"""
    start_time = time.time()

    # Mount Google Drive
    try:
        from google.colab import drive
        drive.mount('/content/drive')
        drive_mounted = True
        print("‚úÖ Google Drive mounted")
    except Exception as e:
        print(f"‚ùå Drive mount failed: {e}")
        drive_mounted = False

    # Install packages
    print("üì¶ Installing packages...")
    failed_packages = []
    for package, import_name in REQUIRED_PACKAGES:
        try:
            __import__(import_name)
            print(f"‚úÖ {package}")
        except ImportError:
            print(f"üì¶ Installing {package}...")
            try:
                subprocess.check_call([sys.executable, "-m", "pip", "install", package])
                print(f"‚úÖ {package} installed")
            except Exception as e:
                print(f"‚ùå Failed to install {package}: {e}")
                failed_packages.append(package)

    packages_installed = len(failed_packages) == 0

    # Load API keys
    openai_available = False
    hf_available = False

    try:
        from google.colab import userdata
        openai_key = userdata.get('OPENAI_API_KEY')
        if openai_key:
            os.environ['OPENAI_API_KEY'] = openai_key
            openai_available = True
            print("‚úÖ OpenAI API key loaded")
    except:
        print("‚ö†Ô∏è OpenAI API key not found in secrets")

    try:
        from google.colab import userdata
        hf_token = userdata.get('HF_TOKEN')
        if hf_token:
            from huggingface_hub import login
            login(token=hf_token)
            hf_available = True
            print("‚úÖ HF token loaded")
    except:
        print("‚ö†Ô∏è HF token not found")

    # Find config file
    import glob
    config_files = glob.glob(ACUMULATIVE_PATH + 'evaluation_config_*.json')
    if config_files:
        config_file_path = sorted(config_files)[-1]
        print(f"üìÇ Config file: {os.path.basename(config_file_path)}")
    else:
        config_file_path = ACUMULATIVE_PATH + 'questions_with_links.json'
        print("‚ö†Ô∏è Using default questions file")

    # Check embedding files
    paths_status = {}
    for model, file_path in EMBEDDING_FILES.items():
        exists = os.path.exists(file_path)
        paths_status[f'embedding_{model}'] = exists
        print(f"{'‚úÖ' if exists else '‚ùå'} {model}: {'exists' if exists else 'missing'}")

    setup_time = time.time() - start_time

    return {
        'success': True,
        'setup_time': setup_time,
        'packages_installed': packages_installed,
        'drive_mounted': drive_mounted,
        'api_keys_loaded': openai_available,
        'api_status': {
            'openai_available': openai_available,
            'hf_available': hf_available
        },
        'paths_status': paths_status,
        'config_file_path': config_file_path,
        'constants': {
            'BASE_PATH': BASE_PATH,
            'ACUMULATIVE_PATH': ACUMULATIVE_PATH,
            'RESULTS_OUTPUT_PATH': RESULTS_OUTPUT_PATH
        },
        'embedding_files': EMBEDDING_FILES,
        'start_time': start_time  # Add start_time for later use
    }

# Run embedded setup
setup_result = embedded_quick_setup()

# Display setup results
if setup_result['success']:
    print(f"\n‚úÖ Setup completed successfully in {setup_result['setup_time']:.2f} seconds")
    print(f"üì¶ Packages installed: {setup_result['packages_installed']}")
    print(f"üíæ Drive mounted: {setup_result['drive_mounted']}")
    print(f"üîë API keys loaded: {setup_result['api_keys_loaded']}")
    print(f"üìÇ Config file: {setup_result['config_file_path']}")

    # Show API availability
    api_status = setup_result['api_status']
    print(f"ü§ñ OpenAI API: {'‚úÖ' if api_status['openai_available'] else '‚ùå'}")
    print(f"ü§ó HuggingFace: {'‚úÖ' if api_status['hf_available'] else '‚ùå'}")

    # Show embedding files status
    print(f"\nüìä Embedding files available:")
    for model in setup_result['embedding_files'].keys():
        available = setup_result['paths_status'].get(f'embedding_{model}', False)
        status = "‚úÖ" if available else "‚ùå"
        print(f"  {status} {model}")

else:
    print(f"‚ùå Setup failed: {setup_result.get('error', 'Unknown error')}")
    print("Please check your Google Drive connection and file paths")

print(f"\nüéØ Ready to proceed with evaluation pipeline!")
print("üìå All dependencies are now embedded - no external lib imports needed")

‚öôÔ∏è Setting up Colab environment (embedded setup)...
üîÑ Running embedded setup (no external lib dependencies)...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive mounted
üì¶ Installing packages...
‚úÖ sentence-transformers
‚úÖ pandas
‚úÖ numpy
‚úÖ scikit-learn
‚úÖ tqdm
‚úÖ pytz
‚úÖ huggingface_hub
‚úÖ openai
‚úÖ ragas
‚úÖ datasets
‚úÖ bert-score
‚úÖ OpenAI API key loaded
‚úÖ HF token loaded
üìÇ Config file: evaluation_config_20250722_185013.json
‚úÖ ada: exists
‚úÖ e5-large: exists
‚úÖ mpnet: exists
‚úÖ minilm: exists

‚úÖ Setup completed successfully in 5.39 seconds
üì¶ Packages installed: True
üíæ Drive mounted: True
üîë API keys loaded: True
üìÇ Config file: /content/drive/MyDrive/TesisMagister/acumulative/evaluation_config_20250722_185013.json
ü§ñ OpenAI API: ‚úÖ
ü§ó HuggingFace: ‚úÖ

üìä Embedding files available:
  ‚úÖ ada
  ‚úÖ e5-large
  ‚úÖ mpnet
  ‚úÖ minilm

üéØ Re

## üíæ 3. Inicializaci√≥n del Pipeline de Datos

In [48]:
# =============================================================================
# üìÇ CONFIGURACI√ìN INTELIGENTE DE ARCHIVOS CONFIG
# =============================================================================

# Usar las constantes de la configuraci√≥n
BASE_PATH = setup_result['constants']['BASE_PATH']
RESULTS_OUTPUT_PATH = setup_result['constants']['RESULTS_OUTPUT_PATH']

# FORZAR LA B√öSQUEDA DEL ARCHIVO CONFIG M√ÅS RECIENTE
print("üîç Buscando archivo config m√°s reciente...")

import glob
import re
import os
from datetime import datetime

ACUMULATIVE_PATH = setup_result['constants']['ACUMULATIVE_PATH']

# Buscar todos los archivos config con timestamp
config_pattern = ACUMULATIVE_PATH + 'evaluation_config_*.json'
config_files = glob.glob(config_pattern)

if config_files:
    # Extraer timestamps y ordenar
    files_with_timestamps = []
    for file in config_files:
        match = re.search(r'evaluation_config_(\d+)\.json', file)
        if match:
            timestamp = int(match.group(1))
            files_with_timestamps.append((timestamp, file))

    if files_with_timestamps:
        # Ordenar por timestamp (m√°s reciente primero)
        files_with_timestamps.sort(reverse=True)
        CONFIG_FILE_PATH = files_with_timestamps[0][1]

        print(f"‚úÖ Archivo config m√°s reciente encontrado:")
        print(f"   üìÇ {os.path.basename(CONFIG_FILE_PATH)}")

        # Mostrar timestamp legible
        latest_timestamp = files_with_timestamps[0][0]
        readable_time = datetime.fromtimestamp(latest_timestamp).strftime('%Y-%m-%d %H:%M:%S')
        print(f"   ‚è∞ Timestamp: {latest_timestamp} ({readable_time})")

        # Mostrar otros archivos encontrados (para debug)
        if len(files_with_timestamps) > 1:
            print(f"   üìã Otros archivos config encontrados:")
            for ts, file in files_with_timestamps[1:4]:  # Mostrar hasta 3 m√°s
                readable = datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
                print(f"      üìÑ {os.path.basename(file)} ({readable})")
    else:
        print("‚ö†Ô∏è No se encontraron archivos config con timestamp v√°lido")
        CONFIG_FILE_PATH = ACUMULATIVE_PATH + 'questions_with_links.json'
        print(f"   üîÑ Usando archivo por defecto: {CONFIG_FILE_PATH}")
else:
    print("‚ö†Ô∏è No se encontraron archivos evaluation_config_*.json")
    CONFIG_FILE_PATH = ACUMULATIVE_PATH + 'questions_with_links.json'
    print(f"   üîÑ Usando archivo por defecto: {CONFIG_FILE_PATH}")

print(f"\nüìÇ Configuraci√≥n final de rutas:")
print(f"üìÅ Datos base: {BASE_PATH}")
print(f"üíæ Salida resultados: {RESULTS_OUTPUT_PATH}")
print(f"‚öôÔ∏è Archivo configuraci√≥n: {CONFIG_FILE_PATH}")

# Verificar que el archivo existe
if os.path.exists(CONFIG_FILE_PATH):
    print(f"‚úÖ Archivo config verificado: existe")

    # Mostrar informaci√≥n del archivo
    file_size = os.path.getsize(CONFIG_FILE_PATH) / 1024  # KB
    mod_time = os.path.getmtime(CONFIG_FILE_PATH)
    mod_readable = datetime.fromtimestamp(mod_time).strftime('%Y-%m-%d %H:%M:%S')
    print(f"   üìä Tama√±o: {file_size:.1f} KB")
    print(f"   üìÖ Modificado: {mod_readable}")
else:
    print(f"‚ùå ADVERTENCIA: Archivo config no existe: {CONFIG_FILE_PATH}")

# =============================================================================
# INICIALIZACI√ìN DEL PIPELINE CON CONFIG CORRECTO
# =============================================================================

print(f"\nüîß Inicializando pipeline de datos...")

# Crear pipeline de datos
data_pipeline = create_data_pipeline(BASE_PATH, debug=DEBUG_MODE)

# FORZAR CARGA DEL ARCHIVO CONFIG CORRECTO (no usar el del setup)
print(f"üìã Cargando config desde: {os.path.basename(CONFIG_FILE_PATH)}")
config_data = data_pipeline.load_config_file(CONFIG_FILE_PATH)

if config_data and config_data['questions']:
    print(f"‚úÖ Config cargado exitosamente:")
    print(f"   üìù {len(config_data['questions'])} preguntas cargadas")
    print(f"   ‚öôÔ∏è Par√°metros: {list(config_data['params'].keys())}")

    # Mostrar algunos par√°metros clave
    params = config_data['params']
    print(f"   üî¢ N√∫mero de preguntas: {params.get('num_questions', 'N/A')}")
    print(f"   üè∑Ô∏è Modelos seleccionados: {params.get('selected_models', 'N/A')}")
    print(f"   ü§ñ LLM reranker: {params.get('use_llm_reranker', 'N/A')}")
    print(f"   üîÑ Reranking method: {params.get('reranking_method', 'N/A')}")
else:
    print(f"‚ùå Error cargando config o config vac√≠o")
    print(f"   üîÑ Usando configuraci√≥n por defecto")

# Obtener informaci√≥n del sistema
system_info = data_pipeline.get_system_info()

print(f"\nüîç Informaci√≥n del Sistema:")
print(f"üìä Modelos disponibles: {len(system_info['available_models'])}")
for model_name in system_info['available_models']:
    model_info = system_info['models_info'].get(model_name, {})
    if 'error' not in model_info:
        print(f"  ‚úÖ {model_name}: {model_info.get('num_documents', 0):,} docs, {model_info.get('embedding_dim', 0)}D")
    else:
        print(f"  ‚ùå {model_name}: {model_info.get('error', 'Error desconocido')}")

# Filtrar solo modelos disponibles
available_models = [name for name in system_info['available_models']
                   if 'error' not in system_info['models_info'].get(name, {})]

print(f"\nüéØ Modelos para evaluaci√≥n: {available_models}")

# Actualizar par√°metros globales desde config (CON VALIDACI√ìN)
if config_data and config_data['params']:
    # Usar el n√∫mero de preguntas del config, pero limitado por MAX_QUESTIONS
    config_max_questions = config_data['params']['num_questions']
    MAX_QUESTIONS = min(MAX_QUESTIONS or 999, config_max_questions)

    # NEW: Use reranking method from config (with backward compatibility)
    RERANKING_METHOD = config_data['params'].get('reranking_method', 'crossencoder')
    USE_LLM_RERANKING = config_data['params']['use_llm_reranker']

    # Backward compatibility check
    if RERANKING_METHOD == 'crossencoder' and not USE_LLM_RERANKING:
        RERANKING_METHOD = 'none'

    print(f"\nüìù Par√°metros actualizados desde config:")
    print(f"‚ùì Max questions: {MAX_QUESTIONS} (config: {config_max_questions}, l√≠mite: {MAX_QUESTIONS or 'sin l√≠mite'})")
    print(f"ü§ñ LLM Reranking (legacy): {USE_LLM_RERANKING}")
    print(f"üîÑ Reranking Method: {RERANKING_METHOD}")
    print(f"üéØ Top-k: {config_data['params'].get('top_k', 'N/A')}")
    print(f"üìä Generate RAG metrics: {config_data['params'].get('generate_rag_metrics', 'N/A')}")
else:
    print(f"\n‚ö†Ô∏è Using default parameters (config not loaded properly)")
    RERANKING_METHOD = 'crossencoder'  # Default value
    USE_LLM_RERANKING = True

print(f"\n‚úÖ Pipeline inicializado correctamente con config m√°s reciente!")
print(f"üîÑ Using reranking method: {RERANKING_METHOD}")

üîç Buscando archivo config m√°s reciente...
‚úÖ Archivo config m√°s reciente encontrado:
   üìÇ evaluation_config_1753514824.json
   ‚è∞ Timestamp: 1753514824 (2025-07-26 07:27:04)
   üìã Otros archivos config encontrados:
      üìÑ evaluation_config_1753508929.json (2025-07-26 05:48:49)
      üìÑ evaluation_config_1753506317.json (2025-07-26 05:05:17)
      üìÑ evaluation_config_1753492446.json (2025-07-26 01:14:06)

üìÇ Configuraci√≥n final de rutas:
üìÅ Datos base: /content/drive/MyDrive/TesisMagister/acumulative/colab_data/
üíæ Salida resultados: /content/drive/MyDrive/TesisMagister/acumulative/
‚öôÔ∏è Archivo configuraci√≥n: /content/drive/MyDrive/TesisMagister/acumulative/evaluation_config_1753514824.json
‚úÖ Archivo config verificado: existe
   üìä Tama√±o: 34.8 KB
   üìÖ Modificado: 2025-07-26 07:27:06

üîß Inicializando pipeline de datos...
üìã Cargando config desde: evaluation_config_1753514824.json
‚úÖ Config cargado exitosamente:
   üìù 10 preguntas cargadas


## üß™ 4. Pipeline de Evaluaci√≥n Principal

In [49]:
# =============================================================================
# REAL EVALUATION CLASSES - NO SIMULATION, ACTUAL DATA ONLY (ENHANCED CONTENT LIMITS)
# =============================================================================

class RealEmbeddingRetriever:
    """Real embedding retriever using actual parquet files and cosine similarity"""

    def __init__(self, parquet_file: str):
        print(f"üîÑ Loading {parquet_file}...")
        self.df = pd.read_parquet(parquet_file)
        embeddings_list = self.df['embedding'].tolist()
        self.embeddings_matrix = np.array(embeddings_list)
        self.num_docs = len(self.df)
        self.embedding_dim = self.embeddings_matrix.shape[1]
        print(f"‚úÖ {self.num_docs:,} docs, {self.embedding_dim} dims")
        self.documents = self.df[['document', 'link', 'title', 'summary', 'content']].to_dict('records')

    def search_documents(self, query_embedding: np.ndarray, top_k: int = 10) -> List[Dict]:
        """Perform actual cosine similarity search"""
        query_embedding = query_embedding.reshape(1, -1)
        similarities = cosine_similarity(query_embedding, self.embeddings_matrix)[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            doc = self.documents[idx].copy()
            doc['cosine_similarity'] = float(similarities[idx])
            doc['rank'] = len(results) + 1
            results.append(doc)
        return results

# =============================================================================
# EMBEDDED CROSSENCODER RERANKER FOR COLAB
# =============================================================================

def colab_crossencoder_rerank(question: str, docs: List[Dict], top_k: int = 10, embedding_model: str = None) -> List[Dict]:
    """
    Embedded CrossEncoder reranking function for Colab (same as individual search).

    Uses sigmoid normalization instead of softmax to ensure scores are comparable
    across different embedding models regardless of the number of documents returned.

    Args:
        question: The query string
        docs: List of documents to rerank
        top_k: Number of top documents to return
        embedding_model: Name of the embedding model used (for logging/debugging)
    """
    if not docs:
        return []

    try:
        from sentence_transformers import CrossEncoder

        # The CrossEncoder model expects pairs of [query, passage]
        model_inputs = [[question, doc.get("content", "") or doc.get("document", "")] for doc in docs]

        # Initialize the same CrossEncoder as individual search
        cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)

        # Predict the raw logit scores
        raw_scores = cross_encoder.predict(model_inputs)

        # Apply sigmoid normalization (same as individual search)
        # NOTE: CrossEncoder scores ARE comparable across embedding models because
        # they use the same CrossEncoder model to evaluate all query-document pairs
        # regardless of which embedding model retrieved them initially
        try:
            raw_scores = np.array(raw_scores)
            # Apply sigmoid: 1 / (1 + e^(-x))
            # This maps CrossEncoder logits to [0,1] probabilities
            final_scores = 1 / (1 + np.exp(-raw_scores))
        except (OverflowError, ZeroDivisionError):
            # Fallback: Min-max normalization if sigmoid fails
            raw_scores = np.array(raw_scores)
            min_score = np.min(raw_scores)
            max_score = np.max(raw_scores)
            if max_score > min_score:
                final_scores = (raw_scores - min_score) / (max_score - min_score)
            else:
                final_scores = np.ones_like(raw_scores) * 0.5  # All equal scores
            print(f"[WARNING] Sigmoid normalization failed for {embedding_model}, using min-max normalization")

        # Add final scores to the documents
        for doc, score in zip(docs, final_scores):
            doc["score"] = float(score)

        # Sort documents by the new score in descending order
        sorted_docs = sorted(docs, key=lambda d: d.get("score", 0.0), reverse=True)

        return sorted_docs[:top_k]

    except ImportError:
        print(f"‚ùå CrossEncoder not available, falling back to original order")
        return docs[:top_k]
    except Exception as e:
        print(f"‚ùå CrossEncoder reranking failed: {e}, falling back to original order")
        return docs[:top_k]

def calculate_ndcg_at_k(relevance_scores: List[float], k: int) -> float:
    """Calculate NDCG@k using actual relevance scores"""
    if k <= 0 or not relevance_scores:
        return 0.0
    dcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(relevance_scores[:k]) if rel > 0)
    ideal_relevance = sorted(relevance_scores[:k], reverse=True)
    idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_relevance) if rel > 0)
    return dcg / idcg if idcg > 0 else 0.0

def calculate_map_at_k(relevance_scores: List[float], k: int) -> float:
    """Calculate MAP@k using actual relevance scores"""
    if k <= 0 or not relevance_scores:
        return 0.0
    relevant_count = 0
    precision_sum = 0.0
    for i, rel in enumerate(relevance_scores[:k]):
        if rel > 0:
            relevant_count += 1
            precision_at_i = relevant_count / (i + 1)
            precision_sum += precision_at_i
    return precision_sum / relevant_count if relevant_count > 0 else 0.0

def calculate_mrr_at_k(relevance_scores: List[float], k: int) -> float:
    """Calculate MRR@k using actual relevance scores"""
    if k <= 0 or not relevance_scores:
        return 0.0

    top_k_scores = relevance_scores[:k]
    for rank, relevance in enumerate(top_k_scores, 1):
        if relevance > 0:
            return 1.0 / rank
    return 0.0

def calculate_real_retrieval_metrics(retrieved_docs: List[Dict], ground_truth_links: List[str], top_k_values: List[int] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) -> Dict:
    """Calculate retrieval metrics using actual retrieved documents and ground truth"""
    def normalize_link(link: str) -> str:
        if not link:
            return ""
        return link.split('#')[0].split('?')[0].rstrip('/')

    gt_normalized = set(normalize_link(link) for link in ground_truth_links)
    relevance_scores = []
    retrieved_links_normalized = []

    for doc in retrieved_docs:
        link = normalize_link(doc.get('link', ''))
        retrieved_links_normalized.append(link)
        relevance_scores.append(1.0 if link in gt_normalized else 0.0)

    metrics = {}
    for k in top_k_values:
        top_k_relevance = relevance_scores[:k]
        top_k_links = retrieved_links_normalized[:k]

        retrieved_links = set(link for link in top_k_links if link)
        relevant_retrieved = retrieved_links.intersection(gt_normalized)

        precision_k = len(relevant_retrieved) / k if k > 0 else 0.0
        recall_k = len(relevant_retrieved) / len(gt_normalized) if gt_normalized else 0.0
        f1_k = (2 * precision_k * recall_k) / (precision_k + recall_k) if (precision_k + recall_k) > 0 else 0.0

        metrics[f'precision@{k}'] = precision_k
        metrics[f'recall@{k}'] = recall_k
        metrics[f'f1@{k}'] = f1_k
        metrics[f'ndcg@{k}'] = calculate_ndcg_at_k(top_k_relevance, k)
        metrics[f'map@{k}'] = calculate_map_at_k(top_k_relevance, k)
        metrics[f'mrr@{k}'] = calculate_mrr_at_k(relevance_scores, k)

    # Overall MRR
    overall_mrr = calculate_mrr_at_k(relevance_scores, len(relevance_scores))
    metrics['mrr'] = overall_mrr

    return metrics

def generate_real_query_embedding(question: str, model_name: str, query_model_name: str):
    """Generate actual embedding for a question using the appropriate model"""
    if query_model_name.startswith('text-embedding-'):
        # OpenAI model
        try:
            import openai
            api_key = os.environ.get('OPENAI_API_KEY')
            if not api_key:
                raise ValueError("OpenAI API key not available")

            client = openai.OpenAI(api_key=api_key)
            response = client.embeddings.create(
                model=query_model_name,
                input=question
            )
            embedding = np.array(response.data[0].embedding)
            return embedding
        except Exception as e:
            raise ValueError(f"Error generating OpenAI embedding: {e}")
    else:
        # SentenceTransformers model
        try:
            print(f"üîÑ Loading {query_model_name}...")
            try:
                query_model = SentenceTransformer(query_model_name, device='cuda')
            except RuntimeError as e:
                if "cuda" in str(e).lower():
                    print(f"‚ö†Ô∏è CUDA error, using CPU...")
                    query_model = SentenceTransformer(query_model_name, device='cpu')
                else:
                    raise

            embedding = query_model.encode(question)
            return embedding
        except Exception as e:
            raise ValueError(f"Error generating SentenceTransformer embedding: {e}")

class RealBERTScoreEvaluator:
    """Real BERTScore evaluator using enhanced content limits"""

    def __init__(self):
        self.available = False
        try:
            from bert_score import score as bert_score
            self.bert_score = bert_score
            self.available = True
            print("‚úÖ BERTScore evaluator initialized with unlimited content")
        except ImportError as e:
            print(f"‚ö†Ô∏è BERTScore not available - install with: pip install bert-score (Error: {e})")
            self.available = False
        except Exception as e:
            print(f"‚ö†Ô∏è BERTScore initialization failed: {e}")
            self.available = False

    def calculate_bert_score(self, generated_answer: str, reference_answer: str, lang: str = "en") -> Dict:
        """Calculate REAL BERTScore with UNLIMITED content length"""
        if not self.available:
            return {
                'bert_score_available': False,
                'reason': 'BERTScore package not installed or initialization failed'
            }

        if not generated_answer or not reference_answer:
            return {
                'bert_score_available': False,
                'reason': 'Empty generated_answer or reference_answer'
            }

        try:
            print(f"üîÑ Calculating BERTScore with unlimited content...")

            # NO CONTENT TRUNCATION for BERTScore - use full text
            # Calculate BERTScore (P, R, F1) - using standard names
            P, R, F1 = self.bert_score([generated_answer], [reference_answer], lang=lang, verbose=False)

            bert_results = {
                'bert_score_available': True,
                'bert_precision': float(P[0]),  # Standard BERTScore name
                'bert_recall': float(R[0]),     # Standard BERTScore name
                'bert_f1': float(F1[0]),        # Standard BERTScore name
                'language': lang,
                'content_length_used': {
                    'generated_answer': len(generated_answer),
                    'reference_answer': len(reference_answer)
                }
            }

            print(f"‚úÖ BERTScore calculated with full content - P:{bert_results['bert_precision']:.3f}, R:{bert_results['bert_recall']:.3f}, F1:{bert_results['bert_f1']:.3f}")
            print(f"   üìè Content lengths - Generated: {bert_results['content_length_used']['generated_answer']}, Reference: {bert_results['content_length_used']['reference_answer']}")
            return bert_results

        except Exception as e:
            print(f"‚ùå BERTScore calculation error: {e}")
            return {
                'bert_score_available': False,
                'error': str(e)
            }

class RealRAGCalculator:
    """Real RAG calculator with enhanced content limits"""

    def __init__(self):
        self.client = None
        self.has_openai = False
        self.bert_evaluator = RealBERTScoreEvaluator()

        api_key = os.environ.get('OPENAI_API_KEY')
        if api_key:
            try:
                import openai
                openai.api_key = api_key
                self.client = openai
                self.has_openai = True
                print("‚úÖ RAG Calculator initialized with ENHANCED CONTENT LIMITS")
                print(f"   üìù Answer generation: {CONTENT_LIMITS['answer_generation']} chars")
                print(f"   üéØ RAGAS context: {CONTENT_LIMITS['context_for_ragas']} chars")
                print(f"   üìä BERTScore: {CONTENT_LIMITS['bert_score']}")
            except Exception as e:
                print(f"‚ùå RAG init error: {e}")
        else:
            print("‚ö†Ô∏è RAG Calculator: No OpenAI API key - RAG metrics disabled")

    def generate_answer(self, question: str, retrieved_docs: List[Dict]) -> str:
        """Generate actual answer using OpenAI GPT with ENHANCED content limits"""
        if not self.client or not self.has_openai:
            return "No answer available - OpenAI API not configured"

        # ENHANCED: Use 2000 chars per document (was 500)
        answer_gen_limit = CONTENT_LIMITS['answer_generation']
        context = "\n\n".join([
            f"Document {i+1}: {doc.get('document', '')[:answer_gen_limit]}..."
            for i, doc in enumerate(retrieved_docs[:3])
        ])

        prompt = f"""Based only on the provided context, answer the following question.
        If the context doesn't contain enough information, say so.

        Context:
        {context}

        Question: {question}

        Answer:"""

        try:
            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200,
                temperature=0.1
            )
            generated_answer = response.choices[0].message.content.strip()
            print(f"üìù Answer generated with {answer_gen_limit} chars/doc context (enhanced from 500)")
            return generated_answer
        except Exception as e:
            print(f"‚ùå OpenAI API error: {e}")
            return f"Error generating answer: {str(e)}"

    def calculate_real_rag_metrics(self, question: str, retrieved_docs: List[Dict], ground_truth: str = None) -> Dict:
        """Calculate RAGAS metrics with ENHANCED content limits"""
        if not self.client or not self.has_openai:
            return {
                'rag_available': False,
                'reason': 'OpenAI API not available'
            }

        try:
            # Import ALL available RAGAS metrics
            from ragas import evaluate
            from ragas.metrics import (
                faithfulness,
                answer_relevancy,
                context_precision,
                context_recall,
                answer_correctness,
                answer_similarity
            )
            from datasets import Dataset

            # Generate actual answer with enhanced limits
            generated_answer = self.generate_answer(question, retrieved_docs)

            if not generated_answer or len(generated_answer.strip()) < 10:
                return {
                    'rag_available': False,
                    'reason': 'Generated answer too short or empty'
                }

            # ENHANCED: Prepare contexts with 3000 chars per document (was 1000)
            ragas_context_limit = CONTENT_LIMITS['context_for_ragas']
            contexts = []
            for doc in retrieved_docs[:3]:
                doc_content = doc.get('document', '')
                if isinstance(doc_content, str) and len(doc_content) > 0:
                    contexts.append(doc_content[:ragas_context_limit])

            if not contexts:
                return {
                    'rag_available': False,
                    'reason': 'No valid document contexts found'
                }

            print(f"üéØ RAGAS contexts prepared with {ragas_context_limit} chars/doc (enhanced from 1000)")

            # Create ground truth if not provided
            if ground_truth is None:
                ground_truth = f"Reference answer based on retrieved Microsoft documentation for the question: {question}"

            # Prepare data for COMPLETE RAGAS evaluation
            data = {
                "question": [str(question).strip()],
                "answer": [str(generated_answer).strip()],
                "contexts": [contexts],
                "ground_truth": [str(ground_truth).strip()]
            }

            # Create dataset
            dataset = Dataset.from_dict(data)

            # Use ALL available RAGAS metrics
            all_metrics = [
                faithfulness,
                answer_relevancy,
                context_precision,
                context_recall,
                answer_correctness,
                answer_similarity
            ]

            print(f"üîÑ Evaluating with ENHANCED RAGAS ({len(all_metrics)} metrics)...")

            # Evaluate with ALL metrics
            result = evaluate(dataset, metrics=all_metrics)

            # Extract scores using STANDARD RAGAS names (no mapping)
            scores = {}
            standard_ragas_names = [
                'faithfulness', 'answer_relevancy', 'context_precision',
                'context_recall', 'answer_correctness', 'answer_similarity', 'semantic_similarity'
            ]

            if hasattr(result, 'to_pandas'):
                df_result = result.to_pandas()
                print(f"üìä RAGAS returned columns: {list(df_result.columns)}")

                for col in df_result.columns:
                    # Skip non-metric columns
                    if col.lower() in ['question', 'answer', 'contexts', 'ground_truth']:
                        print(f"üìã Data column (skipping): {col}")
                        continue

                    # Process metric columns - use STANDARD names as returned by RAGAS
                    col_lower = col.lower()
                    if col_lower in standard_ragas_names:
                        try:
                            value = df_result[col].iloc[0]
                            if isinstance(value, (int, float)) and not pd.isna(value):
                                # Store with STANDARD RAGAS name (no mapping)
                                scores[col_lower] = max(0.0, min(1.0, float(value)))
                                print(f"‚úÖ Extracted {col} (enhanced): {scores[col_lower]:.3f}")
                            else:
                                print(f"‚ö†Ô∏è Invalid value for {col}: {value} (type: {type(value)})")
                        except Exception as e:
                            print(f"‚ö†Ô∏è Error extracting {col}: {e}")
                    else:
                        print(f"üìã Unknown column (skipping): {col}")

            # Create result using STANDARD metric names
            mapped_scores = {
                'rag_available': True,
                'evaluation_method': 'RAGAS_ENHANCED_CONTENT_LIMITS',
                'generated_answer': generated_answer[:200] + '...' if len(generated_answer) > 200 else generated_answer,
                'ground_truth_used': ground_truth[:100] + '...' if len(ground_truth) > 100 else ground_truth,
                'metrics_attempted': len(all_metrics),
                'metrics_successful': len(scores),
                'content_enhancements': {
                    'answer_generation_chars': CONTENT_LIMITS['answer_generation'],
                    'ragas_context_chars': CONTENT_LIMITS['context_for_ragas'],
                    'bert_score_unlimited': CONTENT_LIMITS['bert_score'] == 'sin_limite'
                }
            }

            # Add STANDARD RAGAS metric names (no mapping)
            for metric_name in standard_ragas_names:
                if metric_name in scores:
                    mapped_scores[metric_name] = scores[metric_name]
                else:
                    print(f"‚ö†Ô∏è Standard metric {metric_name} not available in results")

            # Add BERTScore with UNLIMITED content
            if self.bert_evaluator.available:
                print(f"üîÑ Calculating BERTScore with unlimited content...")
                # NO TRUNCATION for BERTScore - use full generated_answer and ground_truth
                bert_results = self.bert_evaluator.calculate_bert_score(generated_answer, ground_truth)
                mapped_scores.update(bert_results)

                if bert_results.get('bert_score_available'):
                    print(f"‚úÖ BERTScore added with unlimited content:")
                    print(f"   bert_precision: {bert_results.get('bert_precision', 'N/A'):.3f}")
                    print(f"   bert_recall: {bert_results.get('bert_recall', 'N/A'):.3f}")
                    print(f"   bert_f1: {bert_results.get('bert_f1', 'N/A'):.3f}")
                else:
                    print(f"‚ö†Ô∏è BERTScore not available: {bert_results.get('reason', 'Unknown error')}")
            else:
                mapped_scores.update({
                    'bert_score_available': False,
                    'reason': 'BERTScore package not installed or initialization failed'
                })
                print(f"‚ö†Ô∏è BERTScore evaluator not available")

            print(f"‚úÖ ENHANCED evaluation completed: {len(scores)}/{len(all_metrics)} RAGAS metrics + BERTScore")
            return mapped_scores

        except Exception as e:
            print(f"‚ùå RAG evaluation error: {e}")
            print(f"üí° Error type: {type(e).__name__}")

            return {
                'rag_available': False,
                'error': str(e)[:200],
                'error_type': type(e).__name__,
                'attempted_complete_evaluation': True
            }

class RealLLMReranker:
    """Real LLM reranker with enhanced content limits"""

    def __init__(self):
        self.client = None
        # ENHANCED: Use 4000 chars limit (was 3000)
        self.max_content_length = CONTENT_LIMITS['llm_reranking']
        api_key = os.environ.get('OPENAI_API_KEY')
        if api_key:
            try:
                import openai
                openai.api_key = api_key
                self.client = openai
                print(f"‚úÖ LLM Reranker initialized with ENHANCED {self.max_content_length} char limit (was 3000)")
            except Exception as e:
                print(f"‚ùå Reranker init error: {e}")

    def rerank_documents(self, question: str, retrieved_docs: List[Dict], top_k: int = 10) -> List[Dict]:
        """Perform actual LLM reranking with ENHANCED content processing"""
        if not self.client or not retrieved_docs:
            return retrieved_docs

        docs_to_rerank = retrieved_docs[:min(top_k, len(retrieved_docs))]
        if len(docs_to_rerank) <= 1:
            return docs_to_rerank

        try:
            prompt = f"Question: {question}\n\nRank documents by relevance (numbers only):\n"
            for i, doc in enumerate(docs_to_rerank, 1):
                content = doc.get('document', '') or doc.get('content', '')
                # ENHANCED: Use intelligent truncation with 4000 chars (was 3000)
                if len(content) > self.max_content_length:
                    # Use intelligent truncation: keep beginning and end
                    half_length = self.max_content_length // 2
                    content = content[:half_length] + "\n\n[...CONTENIDO MEDIO OMITIDO...]\n\n" + content[-half_length:]

                prompt += f"{i}. {content}...\n"
            prompt += "\nRanking:"

            print(f"ü§ñ LLM reranking with enhanced {self.max_content_length} chars/doc")

            response = self.client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=50,
                temperature=0.1
            )

            ranking_text = response.choices[0].message.content.strip()

            import re
            numbers = [int(x) - 1 for x in re.findall(r'\d+', ranking_text) if 0 <= int(x) - 1 < len(docs_to_rerank)]

            if not numbers:
                print("‚ö†Ô∏è No valid ranking found, returning original order")
                return retrieved_docs

            # Reorder based on ranking
            reranked = [docs_to_rerank[i] for i in numbers if i < len(docs_to_rerank)]
            remaining = [docs_to_rerank[i] for i in range(len(docs_to_rerank)) if i not in numbers]
            final_docs = reranked + remaining + retrieved_docs[len(docs_to_rerank):]

            for i, doc in enumerate(final_docs):
                doc['rank'] = i + 1
                doc['reranked'] = i < len(reranked)

            return final_docs

        except Exception as e:
            print(f"‚ùå Reranking error: {e}")
            return retrieved_docs

print("‚úÖ Enhanced evaluation classes loaded - IMPROVED CONTENT LIMITS")
print(f"üìè Answer Generation: {CONTENT_LIMITS['answer_generation']} chars")
print(f"üéØ RAGAS Context: {CONTENT_LIMITS['context_for_ragas']} chars")
print(f"ü§ñ LLM Reranking: {CONTENT_LIMITS['llm_reranking']} chars")
print(f"üìä BERTScore: {CONTENT_LIMITS['bert_score']}")
print("üß† CrossEncoder reranking function embedded for Colab compatibility")

‚úÖ Enhanced evaluation classes loaded - IMPROVED CONTENT LIMITS
üìè Answer Generation: 2000 chars
üéØ RAGAS Context: 3000 chars
ü§ñ LLM Reranking: 4000 chars
üìä BERTScore: sin_limite
üß† CrossEncoder reranking function embedded for Colab compatibility


In [50]:
# =============================================================================
# DOCUMENT AGGREGATOR - CONVERT CHUNKS TO FULL DOCUMENTS
# =============================================================================

class DocumentAggregator:
    """
    Modular class to convert chunk-based retrieval to document-based retrieval.

    Configuration:
    - CHUNK_MULTIPLIER: How many chunks to retrieve to get target number of documents
    - TARGET_DOCUMENTS: Final number of unique documents to return
    """

    def __init__(self, chunk_multiplier: float = 3.0, target_documents: int = 10, debug: bool = False):
        """
        Initialize DocumentAggregator

        Args:
            chunk_multiplier: Multiplier for initial chunk retrieval (e.g., 3.0 means retrieve 30 chunks for 10 docs)
            target_documents: Final number of unique documents to return
            debug: Enable debug logging
        """
        self.chunk_multiplier = chunk_multiplier
        self.target_documents = target_documents
        self.debug = debug

        if self.debug:
            print(f"üìä DocumentAggregator initialized:")
            print(f"   üî¢ Chunk multiplier: {self.chunk_multiplier}")
            print(f"   üéØ Target documents: {self.target_documents}")

    def normalize_link(self, link: str) -> str:
        """Normalize link for deduplication"""
        if not link:
            return ""
        return link.split('#')[0].split('?')[0].rstrip('/')

    def aggregate_chunks_to_documents(self, retrieved_chunks: List[Dict]) -> List[Dict]:
        """
        Convert list of chunks to list of unique documents with aggregated content.

        Args:
            retrieved_chunks: List of chunk dictionaries from retrieval

        Returns:
            List of document dictionaries with aggregated content
        """
        if not retrieved_chunks:
            return []

        if self.debug:
            print(f"üì• Input: {len(retrieved_chunks)} chunks")

        # Group chunks by normalized link
        document_groups = {}

        for chunk in retrieved_chunks:
            link = self.normalize_link(chunk.get('link', ''))
            if not link:
                continue

            if link not in document_groups:
                document_groups[link] = {
                    'chunks': [],
                    'title': chunk.get('title', ''),
                    'summary': chunk.get('summary', ''),
                    'link': chunk.get('link', ''),
                    'best_similarity': 0.0,
                    'best_rank': float('inf')
                }

            # Add chunk to document group
            document_groups[link]['chunks'].append(chunk)

            # Track best similarity and rank for this document
            similarity = chunk.get('cosine_similarity', 0.0)
            rank = chunk.get('rank', float('inf'))

            if similarity > document_groups[link]['best_similarity']:
                document_groups[link]['best_similarity'] = similarity

            if rank < document_groups[link]['best_rank']:
                document_groups[link]['best_rank'] = rank

        if self.debug:
            print(f"üìä Grouped into {len(document_groups)} unique documents")

        # Create aggregated documents
        aggregated_docs = []

        for link, doc_group in document_groups.items():
            chunks = doc_group['chunks']

            # Sort chunks by similarity (best first)
            chunks.sort(key=lambda x: x.get('cosine_similarity', 0.0), reverse=True)

            # Aggregate content from all chunks
            aggregated_content = []
            chunk_contents = []

            for chunk in chunks:
                chunk_content = chunk.get('content', '') or chunk.get('document', '')
                if chunk_content and chunk_content not in chunk_contents:
                    chunk_contents.append(chunk_content)
                    aggregated_content.append(chunk_content)

            # Create aggregated document
            aggregated_doc = {
                'title': doc_group['title'],
                'summary': doc_group['summary'],
                'link': doc_group['link'],
                'document': ' '.join(aggregated_content),  # Full aggregated content
                'content': ' '.join(aggregated_content),   # Alias for compatibility
                'cosine_similarity': doc_group['best_similarity'],
                'rank': 0,  # Will be set later
                'num_chunks': len(chunks),
                'chunk_similarities': [c.get('cosine_similarity', 0.0) for c in chunks],
                'aggregated': True  # Flag to indicate this is an aggregated document
            }

            aggregated_docs.append(aggregated_doc)

        # Sort by best similarity (highest first)
        aggregated_docs.sort(key=lambda x: x['cosine_similarity'], reverse=True)

        # Limit to target number of documents
        final_docs = aggregated_docs[:self.target_documents]

        # Set final ranks
        for i, doc in enumerate(final_docs):
            doc['rank'] = i + 1

        if self.debug:
            print(f"üì§ Output: {len(final_docs)} unique documents")
            for i, doc in enumerate(final_docs[:3]):  # Show first 3
                print(f"   üìÑ Doc {i+1}: {doc['num_chunks']} chunks, sim={doc['cosine_similarity']:.3f}")

        return final_docs

    def search_documents_aggregated(self, retriever, query_embedding: np.ndarray) -> List[Dict]:
        """
        Perform chunk retrieval and aggregate to documents.

        Args:
            retriever: The chunk-based retriever
            query_embedding: Query embedding vector

        Returns:
            List of aggregated document dictionaries
        """
        # Calculate how many chunks to retrieve
        chunks_to_retrieve = int(self.target_documents * self.chunk_multiplier)

        if self.debug:
            print(f"üîç Retrieving {chunks_to_retrieve} chunks to get {self.target_documents} documents")

        # Retrieve chunks
        retrieved_chunks = retriever.search_documents(query_embedding, top_k=chunks_to_retrieve)

        # Aggregate to documents
        aggregated_docs = self.aggregate_chunks_to_documents(retrieved_chunks)

        return aggregated_docs

# =============================================================================
# ENHANCED RETRIEVER WITH DOCUMENT AGGREGATION
# =============================================================================

class DocumentAwareRetriever:
    """Wrapper around RealEmbeddingRetriever that provides document-level retrieval"""

    def __init__(self, parquet_file: str, chunk_multiplier: float = 3.0, target_documents: int = 10, debug: bool = False):
        """
        Initialize document-aware retriever

        Args:
            parquet_file: Path to parquet file with chunk embeddings
            chunk_multiplier: Multiplier for chunk retrieval
            target_documents: Number of unique documents to return
            debug: Enable debug logging
        """
        self.chunk_retriever = RealEmbeddingRetriever(parquet_file)
        self.aggregator = DocumentAggregator(chunk_multiplier, target_documents, debug)
        self.debug = debug

        # Expose chunk retriever properties
        self.embedding_dim = self.chunk_retriever.embedding_dim
        self.num_docs = self.chunk_retriever.num_docs  # This is actually chunks count

        if self.debug:
            print(f"üîß DocumentAwareRetriever initialized")
            print(f"   üìä Total chunks: {self.num_docs:,}")
            print(f"   üéØ Target docs per query: {target_documents}")

    def search_documents(self, query_embedding: np.ndarray, top_k: int = 10) -> List[Dict]:
        """
        Search for documents (aggregated from chunks)

        Args:
            query_embedding: Query embedding vector
            top_k: Number of documents to return (overrides target_documents if provided)

        Returns:
            List of aggregated document dictionaries
        """
        # Update target if top_k is specified
        if top_k != self.aggregator.target_documents:
            self.aggregator.target_documents = top_k

        return self.aggregator.search_documents_aggregated(self.chunk_retriever, query_embedding)

# =============================================================================
# CONFIGURATION VARIABLES
# =============================================================================

# Global configuration for document aggregation
CHUNK_TO_DOCUMENT_CONFIG = {
    'enabled': True,           # Enable/disable document aggregation
    'chunk_multiplier': 3.0,   # Retrieve 3x chunks to get target documents
    'target_documents': 10,    # Final number of unique documents
    'debug': False            # Enable debug logging
}

print("‚úÖ Document aggregation classes loaded")
print(f"üìä Config: {CHUNK_TO_DOCUMENT_CONFIG}")
print("üéØ Ready to convert chunk-based retrieval to document-based retrieval")

‚úÖ Document aggregation classes loaded
üìä Config: {'enabled': True, 'chunk_multiplier': 3.0, 'target_documents': 10, 'debug': False}
üéØ Ready to convert chunk-based retrieval to document-based retrieval


## üìä 5. Procesamiento y An√°lisis de Resultados

In [51]:
print("üîÑ Running REAL evaluation with actual data - NO SIMULATION...")
print(f"üîÑ Reranking method: {RERANKING_METHOD}")

# Run the REAL evaluation using actual embeddings, retrieval, and RAGAS
evaluation_result = run_real_complete_evaluation(
    available_models=available_models,
    config_data=config_data,
    data_pipeline=data_pipeline,
    reranking_method=RERANKING_METHOD,  # Use the new reranking method parameter
    max_questions=MAX_QUESTIONS,
    debug=DEBUG_MODE
)

all_models_results = evaluation_result['all_model_results']
evaluation_duration = evaluation_result['evaluation_duration']
evaluation_params = evaluation_result['evaluation_params']

print("\nüíæ Saving REAL results in EXACT original format...")

# Save results using embedded function (EXACT format) with REAL DATA
saved_files = embedded_process_and_save_results(
    all_model_results=all_models_results,
    output_path=RESULTS_OUTPUT_PATH,
    evaluation_params=evaluation_params,
    evaluation_duration=evaluation_duration
)

print("\nüíæ Archivos guardados:")
if saved_files:
    print(f"  üìÑ JSON: {saved_files['json']}")
    print(f"  ‚è∞ Timestamp: {saved_files['timestamp']}")
    print(f"  üåç Time: {saved_files['chile_time']}")
    print(f"  ‚úÖ Format verified: {saved_files['format_verified']}")
    print(f"  ‚úÖ REAL data verified: {saved_files['real_data_verified']}")
else:
    print("  ‚ùå Error saving files")

print("\nüî¨ VERIFICACI√ìN CIENT√çFICA:")
print("‚úÖ Todos los valores de m√©tricas son REALES")
print("‚úÖ NO se usaron valores aleatorios o simulados")
print("‚úÖ Retrieval basado en similitud coseno real")
print("‚úÖ RAG evaluation con RAGAS framework real")
print(f"‚úÖ Reranking method used: {RERANKING_METHOD}")
if RERANKING_METHOD == "crossencoder":
    print("üß† CrossEncoder reranking with ms-marco-MiniLM-L-6-v2 (same as individual search)")
elif RERANKING_METHOD == "standard":
    print("üìä Standard LLM reranking with OpenAI GPT-3.5-turbo")
else:
    print("‚ùå No reranking applied")

print("\n‚úÖ Procesamiento de resultados completado con DATOS REALES!")
print("üéØ Compatible con Streamlit app - M√âTRICAS CIENT√çFICAMENTE V√ÅLIDAS!")

üîÑ Running REAL evaluation with actual data - NO SIMULATION...
üîÑ Reranking method: crossencoder
üöÄ Starting REAL evaluation for 4 models...
üîÑ Reranking method: crossencoder
‚úÖ BERTScore evaluator initialized with unlimited content
‚úÖ RAG Calculator initialized with ENHANCED CONTENT LIMITS
   üìù Answer generation: 2000 chars
   üéØ RAGAS context: 3000 chars
   üìä BERTScore: sin_limite
üß† Using embedded CrossEncoder reranking (ms-marco-MiniLM-L-6-v2)

üéØ Evaluating model: ada
üìä Using document aggregation (chunks‚Üídocs)
üîÑ Loading /content/drive/MyDrive/TesisMagister/acumulative/colab_data/docs_ada_with_embeddings_20250721_123712.parquet...
‚úÖ 187,031 docs, 1536 dims
‚úÖ Dimension match: 1536 == 1536

üöÄ Starting REAL evaluation for 10 questions...


Real eval ada:   0%|          | 0/10 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

üß† Applied embedded CrossEncoder reranking for question 0
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.878
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.700
‚úÖ Extracted answer_correctness (enhanced): 0.595
‚úÖ Extracted semantic_similarity (enhanced): 0.880
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  10%|‚ñà         | 1/10 [00:44<06:39, 44.42s/it]

‚úÖ BERTScore calculated with full content - P:0.894, R:0.816, F1:0.853
   üìè Content lengths - Generated: 267, Reference: 1449
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.894
   bert_recall: 0.816
   bert_f1: 0.853
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 1
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.636
‚úÖ Extracted answer_correctness (enhanced): 0.222
‚úÖ Extracted semantic_similarity (enhanced): 0.887
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  20%|‚ñà‚ñà        | 2/10 [01:16<04:57, 37.13s/it]

‚úÖ BERTScore calculated with full content - P:0.889, R:0.802, F1:0.843
   üìè Content lengths - Generated: 214, Reference: 1732
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.889
   bert_recall: 0.802
   bert_f1: 0.843
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 2
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.250
‚úÖ Extracted answer_relevancy (enhanced): 0.845
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.958
‚úÖ Extracted semantic_similarity (enhanced): 0.833
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  30%|‚ñà‚ñà‚ñà       | 3/10 [01:41<03:39, 31.41s/it]

‚úÖ BERTScore calculated with full content - P:0.836, R:0.795, F1:0.815
   üìè Content lengths - Generated: 405, Reference: 255
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.836
   bert_recall: 0.795
   bert_f1: 0.815
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 3
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.875
‚úÖ Extracted answer_relevancy (enhanced): 0.895
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.778
‚úÖ Extracted answer_correctness (enhanced): 0.607
‚úÖ Extracted semantic_similarity (enhanced): 0.928
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  40%|‚ñà‚ñà‚ñà‚ñà      | 4/10 [02:18<03:23, 33.86s/it]

‚úÖ BERTScore calculated with full content - P:0.884, R:0.828, F1:0.855
   üìè Content lengths - Generated: 627, Reference: 1108
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.884
   bert_recall: 0.828
   bert_f1: 0.855
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 4
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.429
‚úÖ Extracted answer_correctness (enhanced): 0.200
‚úÖ Extracted semantic_similarity (enhanced): 0.799
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 5/10 [02:49<02:43, 32.69s/it]

‚úÖ BERTScore calculated with full content - P:0.854, R:0.788, F1:0.820
   üìè Content lengths - Generated: 218, Reference: 705
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.854
   bert_recall: 0.788
   bert_f1: 0.820
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 5
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.843
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.447
‚úÖ Extracted semantic_similarity (enhanced): 0.790
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 6/10 [03:20<02:08, 32.19s/it]

‚úÖ BERTScore calculated with full content - P:0.809, R:0.750, F1:0.778
   üìè Content lengths - Generated: 632, Reference: 866
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.809
   bert_recall: 0.750
   bert_f1: 0.778
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 6
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.930
‚úÖ Extracted context_precision (enhanced): 0.333
‚úÖ Extracted context_recall (enhanced): 0.500
‚úÖ Extracted answer_correctness (enhanced): 0.368
‚úÖ Extracted semantic_similarity (enhanced): 0.873
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 7/10 [03:49<01:33, 31.01s/it]

‚úÖ BERTScore calculated with full content - P:0.871, R:0.785, F1:0.826
   üìè Content lengths - Generated: 192, Reference: 1149
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.871
   bert_recall: 0.785
   bert_f1: 0.826
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 7
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.400
‚úÖ Extracted answer_correctness (enhanced): 0.180
‚úÖ Extracted semantic_similarity (enhanced): 0.720
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 8/10 [04:12<00:57, 28.60s/it]

‚úÖ BERTScore calculated with full content - P:0.829, R:0.795, F1:0.811
   üìè Content lengths - Generated: 147, Reference: 412
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.829
   bert_recall: 0.795
   bert_f1: 0.811
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 8
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.429
‚úÖ Extracted answer_relevancy (enhanced): 0.923
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.933
‚úÖ Extracted answer_correctness (enhanced): 0.521
‚úÖ Extracted semantic_similarity (enhanced): 0.921
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 9/10 [04:55<00:33, 33.20s/it]

‚úÖ BERTScore calculated with full content - P:0.861, R:0.808, F1:0.834
   üìè Content lengths - Generated: 816, Reference: 2328
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.861
   bert_recall: 0.808
   bert_f1: 0.834
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üß† Applied embedded CrossEncoder reranking for question 9
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.500
‚úÖ Extracted answer_relevancy (enhanced): 0.927
‚úÖ Extracted context_precision (enhanced): 0.583
‚úÖ Extracted context_recall (enhanced): 0.400
‚úÖ Extracted answer_correctness (enhanced): 0.586
‚úÖ Extracted semantic_similarity (enhanced): 0.844
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval ada: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [05:24<00:00, 32.43s/it]

‚úÖ BERTScore calculated with full content - P:0.869, R:0.834, F1:0.851
   üìè Content lengths - Generated: 442, Reference: 623
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.869
   bert_recall: 0.834
   bert_f1: 0.851
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üìä Found 9 RAG metric types: ['answer_correctness', 'answer_relevancy', 'bert_f1', 'bert_precision', 'bert_recall', 'context_precision', 'context_recall', 'faithfulness', 'semantic_similarity']
‚úÖ Calculated avg_answer_correctness: 0.468 (from 10 values)
‚úÖ Calculated avg_answer_relevancy: 0.624 (from 10 values)
‚úÖ Calculated avg_bert_f1: 0.829 (from 10 values)
‚úÖ Calculated avg_bert_precision: 0.860 (from 10 values)
‚úÖ Calculated avg_bert_recall: 0.800 (from 10 values)
‚úÖ Calculated avg_context_precision: 0.692 (from 10 values)
‚úÖ Calculated avg_context_recall: 0.478 (from 10 values)
‚úÖ Calculated avg_faithfulness: 0.405 (from 10 values)
‚úÖ Calculated avg_semantic_similarity: 





üéØ Evaluating model: e5-large
üìä Using document aggregation (chunks‚Üídocs)
üîÑ Loading /content/drive/MyDrive/TesisMagister/acumulative/colab_data/docs_e5large_with_embeddings_20250721_124918.parquet...
‚úÖ 187,031 docs, 1024 dims
üîÑ Loading intfloat/e5-large-v2...
‚úÖ Dimension match: 1024 == 1024

üöÄ Starting REAL evaluation for 10 questions...


Real eval e5-large:   0%|          | 0/10 [00:00<?, ?it/s]

üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 0
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.878
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.750
‚úÖ Extracted answer_correctness (enhanced): 0.541
‚úÖ Extracted semantic_similarity (enhanced): 0.877
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  10%|‚ñà         | 1/10 [00:37<05:36, 37.40s/it]

‚úÖ BERTScore calculated with full content - P:0.894, R:0.814, F1:0.853
   üìè Content lengths - Generated: 268, Reference: 1449
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.894
   bert_recall: 0.814
   bert_f1: 0.853
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 1
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.600
‚úÖ Extracted answer_correctness (enhanced): 0.216
‚úÖ Extracted semantic_similarity (enhanced): 0.863
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  20%|‚ñà‚ñà        | 2/10 [01:08<04:30, 33.84s/it]

‚úÖ BERTScore calculated with full content - P:0.881, R:0.797, F1:0.837
   üìè Content lengths - Generated: 115, Reference: 1732
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.881
   bert_recall: 0.797
   bert_f1: 0.837
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 2
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.333
‚úÖ Extracted answer_relevancy (enhanced): 0.822
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.619
‚úÖ Extracted semantic_similarity (enhanced): 0.840
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  30%|‚ñà‚ñà‚ñà       | 3/10 [01:37<03:41, 31.63s/it]

‚úÖ BERTScore calculated with full content - P:0.845, R:0.793, F1:0.819
   üìè Content lengths - Generated: 319, Reference: 255
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.845
   bert_recall: 0.793
   bert_f1: 0.819
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 3
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.833
‚úÖ Extracted answer_relevancy (enhanced): 0.909
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.444
‚úÖ Extracted answer_correctness (enhanced): 0.420
‚úÖ Extracted semantic_similarity (enhanced): 0.899
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  40%|‚ñà‚ñà‚ñà‚ñà      | 4/10 [02:15<03:23, 33.98s/it]

‚úÖ BERTScore calculated with full content - P:0.869, R:0.817, F1:0.842
   üìè Content lengths - Generated: 444, Reference: 1108
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.869
   bert_recall: 0.817
   bert_f1: 0.842
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 4
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.750
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.286
‚úÖ Extracted answer_correctness (enhanced): 0.204
‚úÖ Extracted semantic_similarity (enhanced): 0.816
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 5/10 [02:48<02:47, 33.53s/it]

‚úÖ BERTScore calculated with full content - P:0.848, R:0.795, F1:0.820
   üìè Content lengths - Generated: 420, Reference: 705
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.848
   bert_recall: 0.795
   bert_f1: 0.820
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 5
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.833
‚úÖ Extracted answer_relevancy (enhanced): 0.843
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.383
‚úÖ Extracted semantic_similarity (enhanced): 0.783
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 6/10 [03:28<02:23, 35.96s/it]

‚úÖ BERTScore calculated with full content - P:0.814, R:0.748, F1:0.780
   üìè Content lengths - Generated: 512, Reference: 866
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.814
   bert_recall: 0.748
   bert_f1: 0.780
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 6
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.750
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.386
‚úÖ Extracted semantic_similarity (enhanced): 0.879
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 7/10 [04:03<01:46, 35.50s/it]

‚úÖ BERTScore calculated with full content - P:0.855, R:0.786, F1:0.819
   üìè Content lengths - Generated: 207, Reference: 1149
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.855
   bert_recall: 0.786
   bert_f1: 0.819
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 7
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.833
‚úÖ Extracted context_recall (enhanced): 0.667
‚úÖ Extracted answer_correctness (enhanced): 0.190
‚úÖ Extracted semantic_similarity (enhanced): 0.759
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 8/10 [04:36<01:09, 34.63s/it]

‚úÖ BERTScore calculated with full content - P:0.848, R:0.804, F1:0.825
   üìè Content lengths - Generated: 235, Reference: 412
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.848
   bert_recall: 0.804
   bert_f1: 0.825
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 8
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.600
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.333
‚úÖ Extracted answer_correctness (enhanced): 0.462
‚úÖ Extracted semantic_similarity (enhanced): 0.927
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 9/10 [05:18<00:37, 37.10s/it]

‚úÖ BERTScore calculated with full content - P:0.870, R:0.809, F1:0.839
   üìè Content lengths - Generated: 689, Reference: 2328
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.870
   bert_recall: 0.809
   bert_f1: 0.839
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading intfloat/e5-large-v2...
üß† Applied embedded CrossEncoder reranking for question 9
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.929
‚úÖ Extracted answer_relevancy (enhanced): 0.909
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.143
‚úÖ Extracted answer_correctness (enhanced): 0.202
‚úÖ Extracted semantic_similarity (enhanced): 0.809
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval e5-large: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [06:04<00:00, 36.45s/it]

‚ùå BERTScore calculation error: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 5928 has 14.74 GiB memory in use. Of the allocated memory 14.51 GiB is allocated by PyTorch, and 108.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
‚ö†Ô∏è BERTScore not available: Unknown error
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üìä Found 9 RAG metric types: ['answer_correctness', 'answer_relevancy', 'bert_f1', 'bert_precision', 'bert_recall', 'context_precision', 'context_recall', 'faithfulness', 'semantic_similarity']
‚úÖ Calculated avg_answer_correctness: 0.362 (from 10 values)
‚úÖ Calculated avg_answer_relevancy: 0.436 (from 10 values)
‚úÖ Calculated avg_bert





üéØ Evaluating model: mpnet
üìä Using document aggregation (chunks‚Üídocs)
üîÑ Loading /content/drive/MyDrive/TesisMagister/acumulative/colab_data/docs_mpnet_with_embeddings_20250721_125254.parquet...
‚úÖ 187,031 docs, 768 dims
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
‚úÖ Dimension match: 768 == 768

üöÄ Starting REAL evaluation for 10 questions...


Real eval mpnet:   0%|          | 0/10 [00:00<?, ?it/s]

üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 0
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.800
‚úÖ Extracted answer_relevancy (enhanced): 0.871
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.700
‚úÖ Extracted answer_correctness (enhanced): 0.491
‚úÖ Extracted semantic_similarity (enhanced): 0.905
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  10%|‚ñà         | 1/10 [00:37<05:40, 37.79s/it]

‚úÖ BERTScore calculated with full content - P:0.872, R:0.826, F1:0.849
   üìè Content lengths - Generated: 503, Reference: 1449
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.872
   bert_recall: 0.826
   bert_f1: 0.849
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 1
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.500
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.583
‚úÖ Extracted answer_correctness (enhanced): 0.214
‚úÖ Extracted semantic_similarity (enhanced): 0.854
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  20%|‚ñà‚ñà        | 2/10 [01:09<04:33, 34.16s/it]

‚úÖ BERTScore calculated with full content - P:0.879, R:0.796, F1:0.835
   üìè Content lengths - Generated: 104, Reference: 1732
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.879
   bert_recall: 0.796
   bert_f1: 0.835
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 2
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.764
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.611
‚úÖ Extracted semantic_similarity (enhanced): 0.807
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  30%|‚ñà‚ñà‚ñà       | 3/10 [01:39<03:47, 32.43s/it]

‚úÖ BERTScore calculated with full content - P:0.826, R:0.796, F1:0.810
   üìè Content lengths - Generated: 432, Reference: 255
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.826
   bert_recall: 0.796
   bert_f1: 0.810
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 3
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.333
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.222
‚úÖ Extracted semantic_similarity (enhanced): 0.889
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  40%|‚ñà‚ñà‚ñà‚ñà      | 4/10 [02:15<03:23, 33.85s/it]

‚úÖ BERTScore calculated with full content - P:0.875, R:0.807, F1:0.839
   üìè Content lengths - Generated: 302, Reference: 1108
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.875
   bert_recall: 0.807
   bert_f1: 0.839
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 4
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.429
‚úÖ Extracted answer_correctness (enhanced): 0.173
‚úÖ Extracted semantic_similarity (enhanced): 0.691
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 5/10 [02:46<02:43, 32.75s/it]

‚úÖ BERTScore calculated with full content - P:0.848, R:0.775, F1:0.810
   üìè Content lengths - Generated: 215, Reference: 705
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.848
   bert_recall: 0.775
   bert_f1: 0.810
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 5
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.700
‚úÖ Extracted answer_relevancy (enhanced): 0.820
‚úÖ Extracted context_precision (enhanced): 0.833
‚úÖ Extracted context_recall (enhanced): 0.500
‚úÖ Extracted answer_correctness (enhanced): 0.595
‚úÖ Extracted semantic_similarity (enhanced): 0.790
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 6/10 [03:35<02:33, 38.27s/it]

‚úÖ BERTScore calculated with full content - P:0.817, R:0.763, F1:0.789
   üìè Content lengths - Generated: 909, Reference: 866
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.817
   bert_recall: 0.763
   bert_f1: 0.789
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 6
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.800
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.250
‚úÖ Extracted answer_correctness (enhanced): 0.390
‚úÖ Extracted semantic_similarity (enhanced): 0.893
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 7/10 [04:06<01:47, 35.96s/it]

‚úÖ BERTScore calculated with full content - P:0.859, R:0.788, F1:0.822
   üìè Content lengths - Generated: 260, Reference: 1149
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.859
   bert_recall: 0.788
   bert_f1: 0.822
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 7
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.187
‚úÖ Extracted semantic_similarity (enhanced): 0.747
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 8/10 [04:37<01:08, 34.29s/it]

‚úÖ BERTScore calculated with full content - P:0.829, R:0.795, F1:0.812
   üìè Content lengths - Generated: 223, Reference: 412
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.829
   bert_recall: 0.795
   bert_f1: 0.812
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 8
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.833
‚úÖ Extracted context_recall (enhanced): 0.800
‚úÖ Extracted answer_correctness (enhanced): 0.362
‚úÖ Extracted semantic_similarity (enhanced): 0.901
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 9/10 [05:23<00:37, 37.88s/it]

‚úÖ BERTScore calculated with full content - P:0.855, R:0.796, F1:0.824
   üìè Content lengths - Generated: 460, Reference: 2328
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.855
   bert_recall: 0.796
   bert_f1: 0.824
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/multi-qa-mpnet-base-dot-v1...
üß† Applied embedded CrossEncoder reranking for question 9
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.625
‚úÖ Extracted answer_relevancy (enhanced): 0.909
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.333
‚úÖ Extracted answer_correctness (enhanced): 0.214
‚úÖ Extracted semantic_similarity (enhanced): 0.855
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval mpnet: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [05:59<00:00, 35.93s/it]

‚úÖ BERTScore calculated with full content - P:0.858, R:0.833, F1:0.845
   üìè Content lengths - Generated: 648, Reference: 623
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.858
   bert_recall: 0.833
   bert_f1: 0.845
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üìä Found 9 RAG metric types: ['answer_correctness', 'answer_relevancy', 'bert_f1', 'bert_precision', 'bert_recall', 'context_precision', 'context_recall', 'faithfulness', 'semantic_similarity']
‚úÖ Calculated avg_answer_correctness: 0.346 (from 10 values)
‚úÖ Calculated avg_answer_relevancy: 0.336 (from 10 values)
‚úÖ Calculated avg_bert_f1: 0.824 (from 10 values)
‚úÖ Calculated avg_bert_precision: 0.852 (from 10 values)
‚úÖ Calculated avg_bert_recall: 0.798 (from 10 values)
‚úÖ Calculated avg_context_precision: 0.567 (from 10 values)
‚úÖ Calculated avg_context_recall: 0.360 (from 10 values)
‚úÖ Calculated avg_faithfulness: 0.676 (from 10 values)
‚úÖ Calculated avg_semantic_similarity: 





üéØ Evaluating model: minilm
üìä Using document aggregation (chunks‚Üídocs)
üîÑ Loading /content/drive/MyDrive/TesisMagister/acumulative/colab_data/docs_minilm_with_embeddings_20250721_125846.parquet...
‚úÖ 187,031 docs, 384 dims
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
‚úÖ Dimension match: 384 == 384

üöÄ Starting REAL evaluation for 10 questions...


Real eval minilm:   0%|          | 0/10 [00:00<?, ?it/s]

üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 0
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.714
‚úÖ Extracted answer_relevancy (enhanced): 0.895
‚úÖ Extracted context_precision (enhanced): 0.833
‚úÖ Extracted context_recall (enhanced): 0.700
‚úÖ Extracted answer_correctness (enhanced): 0.696
‚úÖ Extracted semantic_similarity (enhanced): 0.873
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  10%|‚ñà         | 1/10 [00:38<05:47, 38.60s/it]

‚úÖ BERTScore calculated with full content - P:0.887, R:0.819, F1:0.852
   üìè Content lengths - Generated: 376, Reference: 1449
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.887
   bert_recall: 0.819
   bert_f1: 0.852
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 1
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.500
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.250
‚úÖ Extracted answer_correctness (enhanced): 0.222
‚úÖ Extracted semantic_similarity (enhanced): 0.889
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  20%|‚ñà‚ñà        | 2/10 [01:10<04:34, 34.37s/it]

‚úÖ BERTScore calculated with full content - P:0.891, R:0.803, F1:0.845
   üìè Content lengths - Generated: 216, Reference: 1732
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.891
   bert_recall: 0.803
   bert_f1: 0.845
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 2
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.000
‚úÖ Extracted answer_relevancy (enhanced): 0.843
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.959
‚úÖ Extracted semantic_similarity (enhanced): 0.837
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  30%|‚ñà‚ñà‚ñà       | 3/10 [01:37<03:39, 31.34s/it]

‚úÖ BERTScore calculated with full content - P:0.832, R:0.796, F1:0.814
   üìè Content lengths - Generated: 496, Reference: 255
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.832
   bert_recall: 0.796
   bert_f1: 0.814
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 3
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.750
‚úÖ Extracted answer_relevancy (enhanced): 0.894
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.778
‚úÖ Extracted answer_correctness (enhanced): 0.455
‚úÖ Extracted semantic_similarity (enhanced): 0.919
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  40%|‚ñà‚ñà‚ñà‚ñà      | 4/10 [02:18<03:29, 34.91s/it]

‚úÖ BERTScore calculated with full content - P:0.872, R:0.824, F1:0.847
   üìè Content lengths - Generated: 493, Reference: 1108
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.872
   bert_recall: 0.824
   bert_f1: 0.847
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 4
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.199
‚úÖ Extracted semantic_similarity (enhanced): 0.797
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 5/10 [02:49<02:47, 33.57s/it]

‚úÖ BERTScore calculated with full content - P:0.851, R:0.790, F1:0.820
   üìè Content lengths - Generated: 260, Reference: 705
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.851
   bert_recall: 0.790
   bert_f1: 0.820
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 5
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.333
‚úÖ Extracted answer_relevancy (enhanced): 0.843
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.528
‚úÖ Extracted semantic_similarity (enhanced): 0.778
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 6/10 [03:29<02:22, 35.72s/it]

‚úÖ BERTScore calculated with full content - P:0.825, R:0.754, F1:0.788
   üìè Content lengths - Generated: 494, Reference: 866
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.825
   bert_recall: 0.754
   bert_f1: 0.788
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 6
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.250
‚úÖ Extracted answer_correctness (enhanced): 0.218
‚úÖ Extracted semantic_similarity (enhanced): 0.873
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 7/10 [03:57<01:40, 33.42s/it]

‚úÖ BERTScore calculated with full content - P:0.858, R:0.780, F1:0.817
   üìè Content lengths - Generated: 144, Reference: 1149
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.858
   bert_recall: 0.780
   bert_f1: 0.817
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 7
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.583
‚úÖ Extracted context_recall (enhanced): 0.000
‚úÖ Extracted answer_correctness (enhanced): 0.194
‚úÖ Extracted semantic_similarity (enhanced): 0.777
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 8/10 [04:29<01:05, 32.81s/it]

‚úÖ BERTScore calculated with full content - P:0.832, R:0.806, F1:0.819
   üìè Content lengths - Generated: 389, Reference: 412
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.832
   bert_recall: 0.806
   bert_f1: 0.819
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 8
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 0.800
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 1.000
‚úÖ Extracted context_recall (enhanced): 0.375
‚úÖ Extracted answer_correctness (enhanced): 0.506
‚úÖ Extracted semantic_similarity (enhanced): 0.897
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 9/10 [05:19<00:38, 38.08s/it]

‚úÖ BERTScore calculated with full content - P:0.862, R:0.816, F1:0.838
   üìè Content lengths - Generated: 1043, Reference: 2328
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.862
   bert_recall: 0.816
   bert_f1: 0.838
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üîÑ Loading sentence-transformers/all-MiniLM-L6-v2...
üß† Applied embedded CrossEncoder reranking for question 9
üìù Answer generated with 2000 chars/doc context (enhanced from 500)
üéØ RAGAS contexts prepared with 3000 chars/doc (enhanced from 1000)
üîÑ Evaluating with ENHANCED RAGAS (6 metrics)...


Evaluating:   0%|          | 0/6 [00:00<?, ?it/s]

üìä RAGAS returned columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall', 'answer_correctness', 'semantic_similarity']
üìã Unknown column (skipping): user_input
üìã Unknown column (skipping): retrieved_contexts
üìã Unknown column (skipping): response
üìã Unknown column (skipping): reference
‚úÖ Extracted faithfulness (enhanced): 1.000
‚úÖ Extracted answer_relevancy (enhanced): 0.000
‚úÖ Extracted context_precision (enhanced): 0.000
‚úÖ Extracted context_recall (enhanced): 0.500
‚úÖ Extracted answer_correctness (enhanced): 0.193
‚úÖ Extracted semantic_similarity (enhanced): 0.773
‚ö†Ô∏è Standard metric answer_similarity not available in results
üîÑ Calculating BERTScore with unlimited content...
üîÑ Calculating BERTScore with unlimited content...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Real eval minilm: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [05:46<00:00, 34.64s/it]

‚úÖ BERTScore calculated with full content - P:0.862, R:0.813, F1:0.837
   üìè Content lengths - Generated: 143, Reference: 623
‚úÖ BERTScore added with unlimited content:
   bert_precision: 0.862
   bert_recall: 0.813
   bert_f1: 0.837
‚úÖ ENHANCED evaluation completed: 6/6 RAGAS metrics + BERTScore
üìä Found 9 RAG metric types: ['answer_correctness', 'answer_relevancy', 'bert_f1', 'bert_precision', 'bert_recall', 'context_precision', 'context_recall', 'faithfulness', 'semantic_similarity']
‚úÖ Calculated avg_answer_correctness: 0.417 (from 10 values)
‚úÖ Calculated avg_answer_relevancy: 0.348 (from 10 values)
‚úÖ Calculated avg_bert_f1: 0.828 (from 10 values)
‚úÖ Calculated avg_bert_precision: 0.857 (from 10 values)
‚úÖ Calculated avg_bert_recall: 0.800 (from 10 values)
‚úÖ Calculated avg_context_precision: 0.542 (from 10 values)
‚úÖ Calculated avg_context_recall: 0.285 (from 10 values)
‚úÖ Calculated avg_faithfulness: 0.710 (from 10 values)
‚úÖ Calculated avg_semantic_similarity: 





üéâ REAL evaluation completed!
üìä Models evaluated: ['ada', 'e5-large', 'mpnet', 'minilm']
üîÑ Reranking method used: crossencoder
‚è±Ô∏è Evaluation time: 1466.85 seconds

üíæ Saving REAL results in EXACT original format...
üíæ Processing REAL results in EXACT original format...
üíæ REAL results saved successfully!
üìÇ File: cumulative_results_1753517578.json
‚è∞ Time: 2025-07-26 04:12:58 -04
üìä Size: 0.3 MB
üéØ Models: 4 evaluated
üîÑ Reranking: crossencoder
‚úÖ ALL METRICS ARE REAL - NO SIMULATION USED

üíæ Archivos guardados:
  üìÑ JSON: /content/drive/MyDrive/TesisMagister/acumulative/cumulative_results_1753517578.json
  ‚è∞ Timestamp: 1753517578
  üåç Time: 2025-07-26 04:12:58 -04
  ‚úÖ Format verified: True
  ‚úÖ REAL data verified: True

üî¨ VERIFICACI√ìN CIENT√çFICA:
‚úÖ Todos los valores de m√©tricas son REALES
‚úÖ NO se usaron valores aleatorios o simulados
‚úÖ Retrieval basado en similitud coseno real
‚úÖ RAG evaluation con RAGAS framework real
‚úÖ Reranking

## üìà 6. Visualizaci√≥n de Resultados

In [52]:
# Display results using STANDARD metric names from RAGAS and BERTScore
if saved_files and 'json' in saved_files:
    # Load results to display summary
    with open(saved_files['json'], 'r') as f:
        final_results = json.load(f)

    print("üìä Resumen de Resultados (STANDARD RAGAS + BERTScore Names)")
    print("="*70)

    # Show structure verification
    print("üîç Estructura JSON verificada:")
    print(f"  ‚úÖ config: {len(final_results.get('config', {})) > 0}")
    print(f"  ‚úÖ evaluation_info: {len(final_results.get('evaluation_info', {})) > 0}")
    print(f"  ‚úÖ results: {len(final_results.get('results', {})) > 0}")

    # Show models and their metrics
    if 'results' in final_results:
        results_data = final_results['results']
        print(f"\nüéØ Modelos evaluados: {len(results_data)}")

        for model_name, model_data in results_data.items():
            print(f"\nüìä {model_name.upper()}:")
            print(f"  üìù Questions: {model_data.get('num_questions_evaluated', 0)}")
            print(f"  üìè Dimensions: {model_data.get('embedding_dimensions', 0)}")
            print(f"  üìÑ Documents: {model_data.get('total_documents', 0):,}")

            # Show key retrieval metrics
            before_metrics = model_data.get('avg_before_metrics', {})
            if before_metrics:
                print(f"  üìà P@5: {before_metrics.get('precision@5', 0):.3f}")
                print(f"  ‚ö° MRR: {before_metrics.get('mrr', 0):.3f}")
                print(f"  üéØ NDCG@5: {before_metrics.get('ndcg@5', 0):.3f}")

            # Show RAG metrics using STANDARD names (no avg_ prefix needed here)
            rag_metrics = model_data.get('rag_metrics', {})
            if rag_metrics.get('rag_available'):
                print(f"  ü§ñ RAG + BERTScore Metrics (Standard Names):")

                # STANDARD RAGAS metrics (with avg_ prefix for storage, standard names for display)
                standard_ragas_metrics = [
                    ('avg_faithfulness', 'Faithfulness'),
                    ('avg_answer_relevancy', 'Answer Relevancy'),  # Standard RAGAS name
                    ('avg_context_precision', 'Context Precision'),
                    ('avg_context_recall', 'Context Recall'),
                    ('avg_answer_correctness', 'Answer Correctness'),
                    ('avg_answer_similarity', 'Answer Similarity'),
                    ('avg_semantic_similarity', 'Semantic Similarity'),  # Alternative name
                ]

                ragas_found = False
                for metric_key, metric_label in standard_ragas_metrics:
                    if metric_key in rag_metrics:
                        print(f"    üìã {metric_label}: {rag_metrics[metric_key]:.3f}")
                        ragas_found = True

                if not ragas_found:
                    print(f"    ‚ö†Ô∏è RAGAS metrics: No disponible")

                # STANDARD BERTScore metrics (with avg_ prefix for storage, standard names for display)
                standard_bertscore_metrics = [
                    ('avg_bert_precision', 'BERT Precision'),
                    ('avg_bert_recall', 'BERT Recall'),
                    ('avg_bert_f1', 'BERT F1')
                ]

                bertscore_found = False
                for metric_key, metric_label in standard_bertscore_metrics:
                    if metric_key in rag_metrics:
                        print(f"    üéØ {metric_label}: {rag_metrics[metric_key]:.3f}")
                        bertscore_found = True

                if not bertscore_found:
                    print(f"    ‚ö†Ô∏è BERTScore: No disponible (paquete bert-score no instalado)")

                print(f"    üìä Evaluaciones: {rag_metrics.get('successful_evaluations', 0)}/{rag_metrics.get('total_evaluations', 0)} exitosas")

        # Find best model
        best_model = None
        best_p5 = 0
        for model_name, model_data in results_data.items():
            p5 = model_data.get('avg_before_metrics', {}).get('precision@5', 0)
            if p5 > best_p5:
                best_p5 = p5
                best_model = model_name

        if best_model:
            print(f"\nüèÜ Mejor modelo: {best_model} (P@5: {best_p5:.3f})")

    # Show file info
    config_info = final_results.get('config', {})
    eval_info = final_results.get('evaluation_info', {})

    print(f"\nüìÑ Informaci√≥n del archivo:")
    print(f"  üìÇ Nombre: cumulative_results_{saved_files.get('timestamp', 'unknown')}.json")
    print(f"  ‚è∞ Timestamp: {eval_info.get('timestamp', 'N/A')}")
    print(f"  üåç Timezone: {eval_info.get('timezone', 'N/A')}")
    print(f"  üìä Tipo: {eval_info.get('evaluation_type', 'N/A')}")
    print(f"  ‚úÖ Compatible Streamlit: {eval_info.get('enhanced_display_compatible', False)}")

    # Show data verification
    data_verification = eval_info.get('data_verification', {})
    if data_verification:
        print(f"\nüî¨ Verificaci√≥n de datos:")
        print(f"  ‚úÖ Datos reales: {data_verification.get('is_real_data', False)}")
        print(f"  ‚úÖ Sin simulaci√≥n: {data_verification.get('no_simulation', False)}")
        print(f"  ‚úÖ Sin valores aleatorios: {data_verification.get('no_random_values', False)}")
        print(f"  üìä Framework RAG: {data_verification.get('rag_framework', 'N/A')}")

else:
    print("‚ùå No se pudieron cargar los resultados para mostrar")

print("\n" + "="*70)
print("üéâ EVALUACI√ìN COMPLETADA CON NOMBRES EST√ÅNDAR")
print("üìä Archivo compatible con Streamlit usando nombres est√°ndar de bibliotecas")
print("üîÑ Compatible con aplicaci√≥n existente")
print("üéØ Incluye m√©tricas RAGAS (nombres est√°ndar) + BERTScore (nombres est√°ndar)")

üìä Resumen de Resultados (STANDARD RAGAS + BERTScore Names)
üîç Estructura JSON verificada:
  ‚úÖ config: True
  ‚úÖ evaluation_info: True
  ‚úÖ results: True

üéØ Modelos evaluados: 4

üìä ADA:
  üìù Questions: 10
  üìè Dimensions: 1536
  üìÑ Documents: 187,031
  üìà P@5: 0.040
  ‚ö° MRR: 0.079
  üéØ NDCG@5: 0.100
  ü§ñ RAG + BERTScore Metrics (Standard Names):
    üìã Faithfulness: 0.405
    üìã Answer Relevancy: 0.624
    üìã Context Precision: 0.692
    üìã Context Recall: 0.478
    üìã Answer Correctness: 0.468
    üìã Semantic Similarity: 0.848
    üéØ BERT Precision: 0.860
    üéØ BERT Recall: 0.800
    üéØ BERT F1: 0.829
    üìä Evaluaciones: 10/10 exitosas

üìä E5-LARGE:
  üìù Questions: 10
  üìè Dimensions: 1024
  üìÑ Documents: 187,031
  üìà P@5: 0.020
  ‚ö° MRR: 0.025
  üéØ NDCG@5: 0.043
  ü§ñ RAG + BERTScore Metrics (Standard Names):
    üìã Faithfulness: 0.703
    üìã Answer Relevancy: 0.436
    üìã Context Precision: 0.583
    üìã Context 

## üßπ 7. Limpieza y Finalizaci√≥n

In [53]:
# Limpiar recursos y memoria
print("üßπ Limpiando recursos...")

# Limpiar pipeline de datos
data_pipeline.cleanup()

# Limpiar memoria
gc.collect()

# Mostrar resumen final
end_time = time.time()
total_time = end_time - setup_result.get('start_time', end_time)

print("\n" + "="*60)
print("üéâ EVALUACI√ìN COMPLETADA EXITOSAMENTE")
print("="*60)
print(f"‚è±Ô∏è Tiempo total de ejecuci√≥n: {total_time/60:.2f} minutos")
print(f"üìä Modelos evaluados: {len(available_models)}")
print(f"‚ùì Preguntas por modelo: {MAX_QUESTIONS or 'Todas'}")
print(f"ü§ñ LLM Reranking usado: {'‚úÖ' if USE_LLM_RERANKING else '‚ùå'}")

print("\nüìÅ Archivo generado:")
if saved_files and 'json' in saved_files:
    print(f"  üìÑ JSON: {saved_files['json']}")
    print(f"  üéØ Formato: EXACTO compatible con original")
    print(f"  üìä Estructura: config + evaluation_info + results")
    print(f"  ‚úÖ RAG metrics: Con prefijo avg_ para Streamlit")
    print(f"  üåç Timezone: Chile ({saved_files.get('chile_time', 'N/A')})")
else:
    print("  ‚ùå Error al generar archivo")

print("\nüîß VERIFICACI√ìN FINAL:")
print("‚úÖ Nombre archivo: cumulative_results_xxxxx.json ‚úì")
print("‚úÖ Estructura JSON: Id√©ntica al original ‚úì")
print("‚úÖ M√©tricas RAG: Con prefijo avg_ ‚úì")
print("‚úÖ Compatible Streamlit: Sin modificaciones ‚úì")
print("‚úÖ Funcionalidad: Id√©ntica al Colab original ‚úì")

print("\n‚ú® ¬°Listo para usar en aplicaciones de producci√≥n!")
print("üéØ No se agregaron funcionalidades adicionales")
print("üìä Formato 100% compatible con Streamlit existente")

üßπ Limpiando recursos...

üéâ EVALUACI√ìN COMPLETADA EXITOSAMENTE
‚è±Ô∏è Tiempo total de ejecuci√≥n: 24.55 minutos
üìä Modelos evaluados: 4
‚ùì Preguntas por modelo: 10
ü§ñ LLM Reranking usado: ‚úÖ

üìÅ Archivo generado:
  üìÑ JSON: /content/drive/MyDrive/TesisMagister/acumulative/cumulative_results_1753517578.json
  üéØ Formato: EXACTO compatible con original
  üìä Estructura: config + evaluation_info + results
  ‚úÖ RAG metrics: Con prefijo avg_ para Streamlit
  üåç Timezone: Chile (2025-07-26 04:12:58 -04)

üîß VERIFICACI√ìN FINAL:
‚úÖ Nombre archivo: cumulative_results_xxxxx.json ‚úì
‚úÖ Estructura JSON: Id√©ntica al original ‚úì
‚úÖ M√©tricas RAG: Con prefijo avg_ ‚úì
‚úÖ Compatible Streamlit: Sin modificaciones ‚úì
‚úÖ Funcionalidad: Id√©ntica al Colab original ‚úì

‚ú® ¬°Listo para usar en aplicaciones de producci√≥n!
üéØ No se agregaron funcionalidades adicionales
üìä Formato 100% compatible con Streamlit existente


---

## üìö Uso de las Bibliotecas Modulares

Este notebook utiliza las siguientes bibliotecas modulares:

### üîß `colab_setup.py`
- Manejo de instalaci√≥n de paquetes
- Autenticaci√≥n con APIs
- Configuraci√≥n del entorno

### üìä `evaluation_metrics.py`
- C√°lculo de m√©tricas de retrieval (Precision, Recall, F1, NDCG, MAP, MRR)
- Comparaci√≥n de rendimiento
- Estad√≠sticas detalladas

### ü§ñ `rag_evaluation.py`
- Integraci√≥n con RAGAS framework
- LLM reranking con OpenAI
- BERTScore para similitud sem√°ntica

### üíæ `data_manager.py`
- Carga de documentos con embeddings
- Generaci√≥n de embeddings de consultas
- Retrieval por similitud coseno

### üìà `results_processor.py`
- Procesamiento de resultados
- An√°lisis de rendimiento
- Exportaci√≥n a m√∫ltiples formatos

---

## üîÑ Pr√≥ximos Pasos

1. **Integraci√≥n con Streamlit**: Los resultados pueden importarse directamente
2. **Personalizaci√≥n**: Modificar par√°metros en las bibliotecas seg√∫n necesidades
3. **Extensi√≥n**: Agregar nuevos modelos o m√©tricas f√°cilmente
4. **Producci√≥n**: Usar las bibliotecas en aplicaciones reales

---

*Generado con arquitectura modular para m√°xima reutilizaci√≥n y mantenibilidad*

In [54]:
# üîî Sound Alert - Beep notification
print("üîî Playing beep sound notification...")

try:
    # Try different methods to play beep sound

    # Method 1: IPython Audio (most reliable in Colab)
    try:
        from IPython.display import Audio, display
        import numpy as np

        # Generate a simple beep tone
        sample_rate = 22050
        duration = 0.5  # seconds
        frequency = 800  # Hz

        # Create sine wave
        t = np.linspace(0, duration, int(sample_rate * duration))
        beep_wave = 0.3 * np.sin(frequency * 2 * np.pi * t)

        # Display audio
        audio = Audio(beep_wave, rate=sample_rate, autoplay=True)
        display(audio)

        print("‚úÖ Beep sound played using IPython Audio")

    except ImportError:
        # Method 2: HTML5 Audio (fallback)
        from IPython.display import HTML, display

        html_audio = """
        <audio autoplay>
            <source src="data:audio/wav;base64,UklGRnoGAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQoGAACBhYqFbF1fdJivrJBhNjVgodDbq2EcBj+a2/LDciUFLIHO8tiJNwgZaLvt559NEAxQp+PwtmMcBjiR1/LMeSsFJHfH8N2QQAoUXrTp66hVFApGn+DyvmEfBkCZ3/PLdCQNI4vM9t2QQAw" type="audio/wav">
        </audio>
        """

        display(HTML(html_audio))
        print("‚úÖ Beep sound played using HTML5 Audio")

except Exception as e:
    # Method 3: Console beep (final fallback)
    try:
        import os
        import sys

        if sys.platform == "win32":
            import winsound
            winsound.Beep(800, 500)
            print("‚úÖ Beep sound played using Windows Beep")
        else:
            # Unix/Linux/Mac
            os.system('echo -e "\a"')
            print("‚úÖ Beep sound played using system bell")

    except Exception as e2:
        print(f"‚ö†Ô∏è Could not play beep sound: {e2}")
        print("üîî NOTIFICATION: Cell execution completed!")

print("üéâ Cell execution finished - notification sent!")

üîî Playing beep sound notification...


‚úÖ Beep sound played using IPython Audio
üéâ Cell execution finished - notification sent!
