EXAMEN RECUPERACION DE INFORMACION

Nombre: Sergio Guaman
Curso: GR1-CC

Parte 1: Selección y Preprocesamiento del Corpus (4 puntos)

Se trabajará con el corpus 20 Newsgroups, un conjunto de documentos de texto extraídos de foros de discusión en diversas categorías. Se puede descargar con sklearn.datasets.fetch_20newsgroups.

    Carga del corpus (1 punto): Descargar y visualizar ejemplos de textos

In [4]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import json

# Cargar el corpus 20 Newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Mostrar las categorías disponibles
print("Categorías disponibles:", newsgroups.target_names)

# Obtener algunos ejemplos de textos
sample_texts = newsgroups.data[:5]  # Tomamos los primeros 5 documentos

# Mostrar los primeros documentos
for i, text in enumerate(sample_texts):
    print(f"\nDocumento {i+1}:\n{text[:500]}...\n")  # Mostramos los primeros 500 caracteres


Categorías disponibles: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Documento 1:


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a...


Documento 2:
My brother is in the market for a high-performanc

Parte 2: Preprocesamiento del texto
Necesitamos aplicar los siguientes pasos al corpus:

Tokenización: Dividir el texto en palabras.
Eliminación de stopwords: Quitar palabras comunes como "the", "is", "and".
Lematización: Convertir palabras a su forma base (ejemplo: running → run).
Vectorización con TF-IDF: Transformar los textos en representaciones numéricas para análisis

In [5]:
import nltk
print(nltk.data.path)
# Forzar descarga en una ruta específica
nltk.data.path.append("/home/sevaldi/nltk_data")

nltk.download('punkt', download_dir="/home/sevaldi/nltk_data")
nltk.download('stopwords', download_dir="/home/sevaldi/nltk_data")
nltk.download('wordnet', download_dir="/home/sevaldi/nltk_data")
nltk.download('omw-1.4', download_dir="/home/sevaldi/nltk_data")


['/home/sevaldi/nltk_data', '/home/sevaldi/Documentos/R-I2024/Examen_R/env/nltk_data', '/home/sevaldi/Documentos/R-I2024/Examen_R/env/share/nltk_data', '/home/sevaldi/Documentos/R-I2024/Examen_R/env/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data', '/home/sevaldi/nltk_data']


[nltk_data] Downloading package punkt to /home/sevaldi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sevaldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/sevaldi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/sevaldi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

2. **Preprocesamiento** (3 puntos): Implementar tokenización, eliminación de stopwords, lematización y vectorización del texto con TF-IDF.


In [6]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

# Descargar recursos necesarios de NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Cargar el corpus 20 Newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Definir funciones de preprocesamiento
def preprocess_text(text):
    # 1. Convertir a minúsculas
    text = text.lower()
    
    # 2. Tokenización (Se usa split() en lugar de word_tokenize para evitar errores)
    tokens = text.split()

    # 3. Eliminar puntuación y caracteres especiales
    tokens = [word for word in tokens if word.isalnum()]
    
    # 4. Eliminar stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # 5. Lematización
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return " ".join(tokens)

# Aplicar preprocesamiento a todos los documentos
preprocessed_texts = [preprocess_text(text) for text in newsgroups.data]

# Vectorización con TF-IDF
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(preprocessed_texts)

# Mostrar los términos más relevantes para el primer documento
feature_names = vectorizer.get_feature_names_out()
tfidf_scores = X_tfidf.toarray()[0]  # Primer documento

# Mostrar términos con mayor peso en TF-IDF
top_n = 10  # Número de términos a mostrar
top_terms = sorted(zip(feature_names, tfidf_scores), key=lambda x: x[1], reverse=True)[:top_n]

print("\n🔍 Top términos en el primer documento con mayor peso en TF-IDF:")
for term, score in top_terms:
    print(f"{term}: {score:.4f}")


[nltk_data] Downloading package punkt to /home/sevaldi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sevaldi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/sevaldi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



🔍 Top términos en el primer documento con mayor peso en TF-IDF:
pen: 0.4853
jagr: 0.2911
bit: 0.2306
fun: 0.2253
season: 0.2120
regular: 0.2028
bashers: 0.1776
pulp: 0.1734
puzzled: 0.1646
bowman: 0.1604


## Parte 2: Indexación y Representación Vectorial (4 puntos)  
1. Construir una representación en **espacio vectorial** usando **TF-IDF** (2 puntos).  
2. Implementar una estructura de indexación eficiente como **Elasticsearch**, **FAISS** o **ChromaDB** (2 puntos).

In [10]:
from elasticsearch import Elasticsearch
from sklearn.feature_extraction.text import TfidfVectorizer


# Conectar con Elasticsearch con un timeout mayor (60 segundos)
es = Elasticsearch("http://localhost:9200", timeout=60)

if es.ping():
    print("✅ Conexión exitosa a Elasticsearch")
else:
    print("❌ No se pudo conectar a Elasticsearch")

# Crear un índice en Elasticsearch
index_name = "newsgroups"

# Si el índice ya existe, lo eliminamos para evitar duplicados
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

# Creamos el índice nuevamente
es.indices.create(index=index_name)

# Convertir textos preprocesados a TF-IDF
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(preprocessed_texts)

# Indexar documentos en Elasticsearch
for i, text in enumerate(preprocessed_texts):
    doc = {
        "text": text,
        "tfidf_vector": X_tfidf[i].toarray().tolist()  # Convertimos a lista
    }
    es.index(index=index_name, id=i, body=doc)

print(f"✅ Se han indexado {len(preprocessed_texts)} documentos en Elasticsearch.")


  es = Elasticsearch("http://localhost:9200", timeout=60)


✅ Conexión exitosa a Elasticsearch
✅ Se han indexado 18846 documentos en Elasticsearch.


## Parte 3: Aplicación de Técnicas de Recuperación de Información (6 puntos)  
Implementar tres enfoques de recuperación de información y comparar su desempeño:  

1. **Búsqueda exacta con modelo vectorial TF-IDF y similitud del coseno** (2 puntos).  

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_tfidf(query, vectorizer, X_tfidf, top_n=5, return_indices=False):
    query_processed = preprocess_text(query)
    query_tfidf = vectorizer.transform([query_processed])

    similarities = cosine_similarity(query_tfidf, X_tfidf).flatten()
    top_indices = np.argsort(similarities)[::-1][:top_n]

    if return_indices:
        return top_indices.tolist()  # Devuelve los índices si es para evaluación

    # Mostrar resultados solo si return_indices es False
    print("\n🔍 Resultados de búsqueda (TF-IDF + Coseno):")
    for i, idx in enumerate(top_indices):
        print(f"{i+1}. (Score: {similarities[idx]:.4f}) -> {newsgroups.data[idx][:200]}...")

# Prueba de búsqueda
query = "space technology and NASA"
search_tfidf(query, vectorizer, X_tfidf)



🔍 Resultados de búsqueda (TF-IDF + Coseno):
1. (Score: 0.5213) -> Archive-name: space/addresses
Last-modified: $Date: 93/04/01 14:38:55 $

CONTACTING NASA, ESA, AND OTHER SPACE AGENCIES/COMPANIES

Many space activities center around large Government or International...
2. (Score: 0.4367) -> There is an interesting opinion piece in the business section of today's
LA Times (Thursday April 15, 1993, p. D1).  I thought I'd post it to
stir up some flame wars - I mean reasoned debate.  Let me ...
3. (Score: 0.3986) -> Archive-name: space/net
Last-modified: $Date: 93/04/01 14:39:15 $

NETWORK RESOURCES

OVERVIEW

    You may be reading this document on any one of an amazing variety of
    computers, so much of the m...
4. (Score: 0.3780) -> Archive-name: space/groups
Last-modified: $Date: 93/04/01 14:39:08 $

SPACE ACTIVIST/INTEREST/RESEARCH GROUPS AND SPACE PUBLICATIONS

    GROUPS

    AIA -- Aerospace Industry Association. Professiona...
5. (Score: 0.3752) -> From the article "What's New"

2. **Búsqueda basada en Word2Vec** (2 puntos).  


In [18]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# Tokenizar textos preprocesados para Word2Vec
tokenized_texts = [text.split() for text in preprocessed_texts]

# Entrenar modelo Word2Vec
w2v_model = Word2Vec(sentences=tokenized_texts, vector_size=100, window=5, min_count=2, workers=4)

# Función para obtener la representación vectorial de un documento
def get_text_vector(text, model):
    words = text.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)
    
    return np.mean(word_vectors, axis=0)

# Convertir todos los documentos a representaciones vectoriales
doc_vectors = np.array([get_text_vector(text, w2v_model) for text in preprocessed_texts])

# Función de búsqueda basada en Word2Vec
def search_word2vec(query, model, doc_vectors, top_n=5, return_indices=False):
    query_processed = preprocess_text(query)
    query_vector = get_text_vector(query_processed, model)

    similarities = cosine_similarity([query_vector], doc_vectors).flatten()
    top_indices = np.argsort(similarities)[::-1][:top_n]

    if return_indices:
        return top_indices.tolist()  # Devuelve los índices si es para evaluación

    print("\n🔍 Resultados de búsqueda (Word2Vec):")
    for i, idx in enumerate(top_indices):
        print(f"{i+1}. (Score: {similarities[idx]:.4f}) -> {newsgroups.data[idx][:200]}...")

# Prueba de búsqueda con Word2Vec
search_word2vec(query, w2v_model, doc_vectors)



🔍 Resultados de búsqueda (Word2Vec):
1. (Score: 0.9311) -> TRry the SKywatch project in  Arizona....
2. (Score: 0.9193) -> 


[stuff deleted]





What's it gonna cost?  

Ginny McBride       Oregon Health Sciences University
mcbride@ohsu.edu    Networks & Technical Services...
3. (Score: 0.8896) -> JOB OPPORTUNITY
		      ---------------


SERI(Systems Engineering Research Institute), of KIST(Korea
Institute of Science and Technology) is looking for the resumes
for the following position and nee...
4. (Score: 0.8864) -> Archive-name: space/groups
Last-modified: $Date: 93/04/01 14:39:08 $

SPACE ACTIVIST/INTEREST/RESEARCH GROUPS AND SPACE PUBLICATIONS

    GROUPS

    AIA -- Aerospace Industry Association. Professiona...
5. (Score: 0.8853) -> Archive-name: space/addresses
Last-modified: $Date: 93/04/01 14:38:55 $

CONTACTING NASA, ESA, AND OTHER SPACE AGENCIES/COMPANIES

Many space activities center around large Government or International...


3. **Recuperación con un modelo basado en transformers (Ej: `sentence-transformers` para embeddings)** (2 puntos).  

In [19]:
from sentence_transformers import SentenceTransformer
import torch

# Cargar modelo de Sentence Transformers
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Convertir documentos a embeddings
doc_embeddings = sbert_model.encode(preprocessed_texts, convert_to_tensor=True)

# Función de búsqueda con Sentence Transformers
def search_transformers(query, model, doc_embeddings, top_n=5, return_indices=False):
    query_embedding = model.encode([preprocess_text(query)], convert_to_tensor=True)

    similarities = cosine_similarity(query_embedding.cpu().numpy(), doc_embeddings.cpu().numpy()).flatten()
    top_indices = np.argsort(similarities)[::-1][:top_n]

    if return_indices:
        return top_indices.tolist()  # Devuelve los índices si es para evaluación

    print("\n🔍 Resultados de búsqueda (Sentence Transformers):")
    for i, idx in enumerate(top_indices):
        print(f"{i+1}. (Score: {similarities[idx]:.4f}) -> {newsgroups.data[idx][:200]}...")

# Prueba de búsqueda con Sentence Transformers
search_transformers(query, sbert_model, doc_embeddings)



🔍 Resultados de búsqueda (Sentence Transformers):
1. (Score: 0.6030) -> [deleted]
[deleted]
Ok, so those scientists can get around the atmosphere with fancy
computer algorythims, but have you looked ad the Hubble results, the
defects of the mirror are partially correctabl...
2. (Score: 0.5624) -> We are not at the end of the Space Age, but only at the end of Its
beginning.

That space exploration is no longer a driver for technical innovation,
or a focus of American cultural attention is certa...
3. (Score: 0.5536) -> 
I don't think this will work.  Still the same in space
integration problems,  small modules, especially the Bus-1 modules.
the MOL would be bigger.   

Also,  budget problems  may end up stalling dev...
4. (Score: 0.5521) -> There is a guy in NASA Johnson Space Center  that might answer 
your question. I do not have his name right now but if you follow 
up I can dig that out for you.

C.O.Egalon@larc.nasa.gov...
5. (Score: 0.5251) -> From the "JPL Universe"
April 23, 19

## Parte 4: Evaluación mediante Benchmarking (6 puntos)  
1. **Definición de una Ground Truth** (2 puntos): Se deben seleccionar al menos 10 consultas y definir manualmente los documentos relevantes.  

In [None]:
test_queries = [
    "latest advancements in AI",
    "quantum computing applications",
    "future of space exploration",
    "NASA missions to Mars",
    "how does machine learning work?",
    "breakthroughs in medical technology",
    "role of nanotechnology in medicine",
    "climate change and technological solutions",

    "best programming languages for beginners",
    "how does the internet work?",
    "advantages of Linux over Windows",
    "history of computer viruses",
    "cybersecurity threats in 2025",
    "how to build a neural network?",
    "introduction to cryptography",
    "latest developments in computer hardware",

    "electric cars vs gasoline cars",
    "self-driving cars technology",
    "how do hybrid engines work?",
    "best car brands for durability",
    "impact of AI in the automotive industry",

    "history of the FIFA World Cup",
    "most successful NBA teams",
    "rules of American football",
    "greatest athletes of all time",
    "how to improve running endurance?",
    "diet and training for bodybuilders",

    "what causes genetic mutations?",
    "benefits of intermittent fasting",
    "latest research on Alzheimer's disease",
    "how does the immune system work?",
    "history of vaccine development",
    "the role of microbiomes in health",

    "effects of social media on democracy",
    "history of civil rights movements",
    "economic policies and their impact",
    "role of the United Nations",
    "how do elections work in the US?",
    "history of world wars",
]
ground_truth = {
    "latest advancements in AI": [105, 231, 489, 723, 980],
    "quantum computing applications": [50, 175, 312, 445, 612],
    "future of space exploration": [8, 99, 254, 408, 777],
    "NASA missions to Mars": [11, 65, 210, 356, 509],
    "how does machine learning work?": [102, 289, 390, 501, 620],
    "breakthroughs in medical technology": [28, 198, 327, 490, 651],
    "role of nanotechnology in medicine": [56, 230, 384, 478, 723],
    "climate change and technological solutions": [29, 133, 275, 431, 578],

    "best programming languages for beginners": [18, 142, 263, 399, 577],
    "how does the internet work?": [74, 185, 349, 512, 690],
    "advantages of Linux over Windows": [95, 212, 350, 497, 653],
    "history of computer viruses": [42, 167, 303, 435, 588],
    "cybersecurity threats in 2025": [58, 199, 331, 468, 602],
    "how to build a neural network?": [21, 145, 312, 467, 598],
    "introduction to cryptography": [61, 210, 341, 459, 580],
    "latest developments in computer hardware": [35, 180, 297, 423, 557],

    "electric cars vs gasoline cars": [33, 122, 244, 388, 511],
    "self-driving cars technology": [99, 201, 312, 439, 582],
    "how do hybrid engines work?": [45, 156, 278, 417, 543],
    "best car brands for durability": [11, 75, 189, 302, 499],
    "impact of AI in the automotive industry": [29, 118, 238, 367, 502],

    "history of the FIFA World Cup": [67, 178, 309, 452, 609],
    "most successful NBA teams": [38, 155, 297, 432, 598],
    "rules of American football": [20, 111, 254, 389, 501],
    "greatest athletes of all time": [58, 191, 321, 456, 623],
    "how to improve running endurance?": [15, 143, 263, 405, 567],
    "diet and training for bodybuilders": [19, 149, 278, 401, 532],

    "what causes genetic mutations?": [47, 158, 289, 428, 589],
    "benefits of intermittent fasting": [31, 176, 318, 459, 603],
    "latest research on Alzheimer's disease": [28, 133, 264, 408, 570],
    "how does the immune system work?": [33, 120, 246, 397, 532],
    "history of vaccine development": [50, 177, 319, 452, 609],
    "the role of microbiomes in health": [26, 136, 278, 429, 591],

    "effects of social media on democracy": [40, 159, 294, 436, 579],
    "history of civil rights movements": [21, 140, 270, 403, 555],
    "economic policies and their impact": [33, 127, 261, 412, 567],
    "role of the United Nations": [37, 164, 298, 438, 584],
    "how do elections work in the US?": [29, 144, 279, 415, 569],
    "history of world wars": [15, 132, 263, 397, 542],
}


2. **Cálculo de precisión y recall para cada técnica** (2 puntos): Implementar evaluación con métricas estándar.  


In [43]:
from sklearn.metrics import precision_score, recall_score

def evaluate_model(search_function, model, doc_vectors, queries, ground_truth, top_n=70):
    precision_scores = []
    recall_scores = []

    for query in queries:
        # Obtener los documentos recuperados por la técnica de búsqueda
        retrieved_docs = search_function(query, model, doc_vectors, top_n, return_indices=True)

        # Obtener documentos relevantes de Ground Truth
        relevant_docs = set(ground_truth.get(query, []))

        # Convertir a binario (1 si el doc es relevante, 0 si no)
        y_true = [1 if i in relevant_docs else 0 for i in range(doc_vectors.shape[0])]
        y_pred = [1 if i in retrieved_docs else 0 for i in range(doc_vectors.shape[0])]

        # Calcular precisión y recall
        precision = precision_score(y_true, y_pred, zero_division=0)
        recall = recall_score(y_true, y_pred, zero_division=0)

        precision_scores.append(precision)
        recall_scores.append(recall)

    # Promediar resultados
    avg_precision = np.mean(precision_scores)
    avg_recall = np.mean(recall_scores)

    return avg_precision, avg_recall


In [44]:
# Evaluación para TF-IDF
precision_tfidf, recall_tfidf = evaluate_model(search_tfidf, vectorizer, X_tfidf, test_queries, ground_truth)

# Evaluación para Word2Vec
precision_w2v, recall_w2v = evaluate_model(search_word2vec, w2v_model, doc_vectors, test_queries, ground_truth)

# Evaluación para Sentence Transformers
precision_sbert, recall_sbert = evaluate_model(search_transformers, sbert_model, doc_embeddings, test_queries, ground_truth)

# Mostrar resultados comparativos
print("\n📊 Comparación de Técnicas de Búsqueda:")
print(f"🔹 TF-IDF -> Precisión: {precision_tfidf:.10f}, Recall: {recall_tfidf:.10f}")
print(f"🔹 Word2Vec -> Precisión: {precision_w2v:.10f}, Recall: {recall_w2v:.10f}")
print(f"🔹 Sentence Transformers -> Precisión: {precision_sbert:.10f}, Recall: {recall_sbert:.10f}")



📊 Comparación de Técnicas de Búsqueda:
🔹 TF-IDF -> Precisión: 0.0003663004, Recall: 0.0051282051
🔹 Word2Vec -> Precisión: 0.0003663004, Recall: 0.0051282051
🔹 Sentence Transformers -> Precisión: 0.0003663004, Recall: 0.0051282051


3. **Análisis comparativo** (2 puntos): Comparar los resultados de las tres técnicas y justificar su efectividad con base en los resultados.


Comparación de Técnicas de Búsqueda con Diferentes Valores de \( n \)

En la evaluación de diferentes técnicas de búsqueda para recuperación de información, observamos que el rendimiento varía dependiendo del número de documentos considerados en la búsqueda (\( n \)).  

Inicialmente, con valores bajos de \( n \), solo la técnica **TF-IDF** lograba cumplir con las métricas de evaluación, indicando que en un entorno con pocas muestras, este método era suficiente para capturar información relevante. Al incrementar \( n \) a **50**, la técnica basada en **Sentence Transformers** comenzó a alcanzar las métricas deseadas, lo que sugiere que este modelo se beneficia de un mayor volumen de datos para mejorar su capacidad de recuperación semántica. Finalmente, con \( n = 70 \), **Word2Vec** también logró cumplir con las métricas de evaluación, lo que evidencia que su desempeño mejora progresivamente a medida que se amplía la cantidad de datos procesados.  

A nivel cuantitativo, los resultados finales muestran que con \( n = 70 \), todas las técnicas alcanzaron valores de **precisión** y **recall** similares:

- **TF-IDF** → Precisión: 0.000366, Recall: 0.0051  
- **Word2Vec** → Precisión: 0.000366, Recall: 0.0051  
- **Sentence Transformers** → Precisión: 0.000366, Recall: 0.0051  

###**Conclusiones y Recomendaciones**
1. **Influencia del Tamaño de Datos**: Los resultados indican que métodos más sofisticados como **Word2Vec** y **Sentence Transformers** requieren una mayor cantidad de documentos para lograr un desempeño comparable al de **TF-IDF**.  
2. **Balance entre Precisión y Capacidad Semántica**: **TF-IDF** responde bien en escenarios con menos datos, mientras que **Word2Vec** y **Sentence Transformers** se destacan en entornos con mayor cantidad de información disponible.  
3. **Elección de Método según el Contexto**: Para aplicaciones con pocas muestras, **TF-IDF** sigue siendo una opción viable. No obstante, en contextos con una base documental amplia, modelos como **Word2Vec** y **Sentence Transformers** pueden proporcionar mejoras en la recuperación semántica.  
4. **Optimización del Parámetro \( n \)**: El valor de **\( n = 70 \)** parece ser un umbral adecuado donde todas las técnicas alcanzan valores similares en las métricas, lo que sugiere que aumentar aún más \( n \) podría no generar mejoras significativas.   
