# SEMANTIC SEARCH IN ARTICLES USING NLP

### This notebook implements and compares three different search approaches:
### 1. Lexical Search (BM25) - Traditional keyword-based retrieval
### 2. Semantic Search (spaCy) - Word embeddings-based search
### 3. Semantic Search (Transformers) - BERT-based contextual embeddings
###
### The system also extracts hot keywords using YAKE and KeyBERT algorithms
### It also provide and approach to evaluate differenct models
### ============================================================================

## DATA PREPARATION


- BBC News Articles Dataset (2004–2005)
- The BBC News Articles Dataset is a collection of 2,225 news documents published by the BBC between 2004 and 2005.
- It covers five major categories: Business, Entertainment, Sport, Tech, Politics

- Each document contains the full text of a BBC news article, making it well-suited for information retrieval, keyword extraction, and semantic search tasks.

- It is publicly available through the BBC News Summary dataset on Kaggle [dataset link](https://www.kaggle.com/datasets/pariza/bbc-news-summary), originally compiled for text summarization and classification research.

Data Transformation

In [57]:
import os
import pandas as pd

# Define paths
base_dir = "../data/raw/NewsArticles"
output_path = "../data/processed/articles.csv"

data = []
article_id = 0

# Check base directory
if not os.path.exists(base_dir):
    raise FileNotFoundError(f"Directory not found: {base_dir}")

# Loop through categories and articles
for category in os.listdir(base_dir):
    category_path = os.path.join(base_dir, category)
    if not os.path.isdir(category_path):
        continue

    for filename in os.listdir(category_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(category_path, filename)
            try:
                with open(file_path, "r", encoding="latin1") as f:
                    text = f.read().strip()
                data.append({
                    "id": article_id,
                    "category": category,
                    "text": text
                })
                article_id += 1
            except Exception as e:
                print(f"Failed to read {file_path}: {e}")

# Create DataFrame
df = pd.DataFrame(data)

# Ensure output directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Save combined dataset
df.to_csv(output_path, index=False, encoding="utf-8")

# Display preview
print(f"Loaded {len(df)} articles across {df['category'].nunique()} categories.")
print(f"Saved to: {output_path}")
display(df.head())


Loaded 2225 articles across 5 categories.
Saved to: ../data/processed/articles.csv


Unnamed: 0,id,category,text
0,0,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,1,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,2,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,3,business,High fuel prices hit BA's profits\n\nBritish A...
4,4,business,Pernod takeover talk lifts Domecq\n\nShares in...


Load the processed articles dataset


In [59]:
data =pd.read_csv("../data/processed/articles.csv")
df=data.copy()

Display category distribution to understand dataset composition

In [60]:
df['category'].value_counts()

category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

Text Cleaning

In [61]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text.strip()

df['text'] = data['text'].apply(clean_text)

Verify cleaned data structure

In [62]:
df.head()

Unnamed: 0,id,category,text
0,0,business,ad sales boost time warner profit quarterly pr...
1,1,business,dollar gains on greenspan speech the dollar ha...
2,2,business,yukos unit buyer faces loan claim the owners o...
3,3,business,high fuel prices hit bas profits british airwa...
4,4,business,pernod takeover talk lifts domecq shares in uk...


## HOT KEYWORDS EXTRACTION

 Extract important keywords from articles using two different algorithms:
 - YAKE: Statistical approach based on word frequency and position
 - KeyBERT: Transformer-based approach using semantic similarity

### 1: YAKE (Yet Another Keyword Extractor)

In [63]:
import yake
def extract_hot_keywords_yake(text, top_n=10):
    """
    Extract hot keywords using YAKE algorithm.
    
    YAKE is an unsupervised keyword extraction method that uses:
    - Word frequency
    - Word position
    - Word context
    - Word case information
    
    Args:
        text (str): Input article text
        top_n (int): Number of keywords to extract (default: 10)
        
    Returns:
        list: List of (keyword, score) tuples, lower scores = more important
    """
    kw_extractor = yake.KeywordExtractor(
        lan="en",                    # Language: English
        n=3,                         # Max n-gram size (1-3 word phrases)
        dedupLim=0.9,                # Deduplication threshold (0.9 = high similarity)
        dedupFunc='seqm',            # Sequence matcher for deduplication
        windowsSize=1,               # Context window size
        top=top_n,                   # Number of keywords to return
        features=None
    )
    
    keywords = kw_extractor.extract_keywords(text)
    return keywords

Extract keywords for all articles using YAKE

In [None]:
yake_hot_keywords_df = pd.DataFrame({
    'id': df['id'],
    'hot_keywords': df['text'].apply(
        lambda text: [kw[0] for kw in extract_hot_keywords_yake(text, top_n=10)]
    )
})

In [65]:
yake_hot_keywords_df.head()

Unnamed: 0,article_id,hot_keywords
0,0,"[fourth quarter profits, warners fourth quarte..."
1,1,"[current account deficit, federal reserve head..."
2,2,"[owner menatep group, case menatep groups, men..."
3,3,"[high fuel prices, blamed high fuel, fuel cost..."
4,4,"[allied domecq shares, lifts domecq shares, al..."


Save YAKE keywords to JSON for later use

In [66]:
yake_hot_keywords_df.to_json(
    '../data/hot_keywords/yake_hot_keywords.json', 
    orient='records',
    indent=2,
    force_ascii=False
)

### 2: KeyBERT (Transformer-based Keyword Extraction)

Initialize KeyBERT with pre-trained sentence transformer model

In [67]:
from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

In [68]:
def extract_hot_keywords_keybert(text, top_n=10):
    """
    Extract keywords using KeyBERT (transformer-based approach).
    
    KeyBERT uses BERT embeddings to find words/phrases most similar 
    to the document, providing semantically relevant keywords.
    
    Args:
        text (str): Input article text
        top_n (int): Number of keywords to extract
        
    Returns:
        list: List of (keyword, score) tuples with similarity scores
    """
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 3),   # Extract 1-3 word phrases
        stop_words='english',            # Remove English stop words
        top_n=top_n
    )
    return keywords


Extract keywords for all articles using KeyBERT

In [None]:
bert_hot_keywords_df = pd.DataFrame({
    'id': df['id'],
    'hot_keywords': df['text'].apply(
        lambda text: [kw[0] for kw in extract_hot_keywords_yake(text, top_n=10)]
    )
})

Save KeyBERT keywords to JSON

In [70]:
bert_hot_keywords_df.to_json(
    '../data/hot_keywords/Keybert_hot_keywords.json', 
    orient='records',
    indent=2,
    force_ascii=False
)

## LEXICAL SEARCH (BM25)

- BM25 is a probabilistic ranking function used by search engines.
- It scores documents based on term frequency and inverse document frequency

Download spaCy language model for text processing

In [None]:
!python -m spacy download en_core_web_sm

In [16]:
import spacy
from contractions import fix
import os
import pickle
from rank_bm25 import BM25Okapi
import numpy as np


Load spaCy model (disable unnecessary components for speed)

In [17]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

In [18]:
def preprocess_text(text):
    """
    Complete preprocessing pipeline for lexical search.
    
    Pipeline:
    1. Lowercase conversion
    2. Contraction expansion (e.g., "don't" -> "do not")
    3. Tokenization using spaCy
    4. Remove stop words and short tokens (< 3 chars)
    5. Lemmatization (convert words to base form)
    
    Args:
        text (str): Raw text string
        
    Returns:
        list: List of processed tokens
    """
    # Step 1: Lowercase
    text = text.lower()
    
    # Step 2: Expand contractions (don't -> do not)
    text = fix(text)
    
    # Step 3-5: Tokenization, cleaning, lemmatization
    doc = nlp(text)
    tokens = [
        token.lemma_ 
        for token in doc 
        if token.is_alpha and not token.is_stop and len(token) > 2
    ]
    return tokens

Create corpus by preprocessing all articles

In [19]:
corpus= df['text'].apply(preprocess_text).tolist()

Save preprocessed corpus for future use

In [20]:
file_path = os.path.join("../data/processed/", "corpus.pkl")
with open(file_path, "wb") as f:
    pickle.dump(corpus, f)

Initialize BM25 model with preprocessed corpus

In [21]:
bm25 = BM25Okapi(corpus)

Save BM25 model

In [22]:
os.makedirs("../models", exist_ok=True)
with open("../models/bm25_model.pkl", "wb") as f:
    pickle.dump(bm25, f)

In [23]:
def search_articles_bm25(query_sentence, df, top_n=5):
    """
    Retrieve most relevant articles using BM25 ranking.
    
    BM25 (Best Match 25) considers:
    - Term frequency: How often query terms appear in document
    - Document length: Normalize by document length
    - Inverse document frequency: Rare terms are more valuable
    
    Args:
        query_sentence (str): User search query
        df (DataFrame): Articles dataframe
        top_n (int): Number of results to return
        
    Returns:
        DataFrame: Top N articles with BM25 scores
    """
    # Clean and preprocess query
    query_sentence = clean_text(query_sentence)
    query_tokens = preprocess_text(query_sentence)
    
    # Compute BM25 scores for all documents
    scores = bm25.get_scores(query_tokens)
    
    # Get indices of top-N highest scoring documents
    top_n_idx = np.argsort(scores)[::-1][:top_n]
    
    # Build results DataFrame
    results = df.iloc[top_n_idx].copy()
    results["bm25_score"] = scores[top_n_idx]
    
    return results[["id", "category", "bm25_score", "text"]]

Test BM25 search with sample query

In [24]:
sentence_query="Analyze China's export surge of 35% that propelled its trade surplus to a six-year high, identifying the key export sectors, the role of currency valuation, the impact on global trade balances, and the responses from trading partners."
results_bm25 = search_articles_bm25(sentence_query, df,top_n=5)

In [25]:
results_bm25

Unnamed: 0,id,category,bm25_score,text
426,426,business,59.00927,chinese exports rise 25 in 2004 exports from c...
439,439,business,49.612413,us trade deficit widens sharply the gap betwee...
504,504,business,42.571624,china now top trader with japan china overtook...
23,23,business,33.683889,us trade gap hits record in 2004 the gap betwe...
428,428,business,33.15755,us trade gap ballooned in october the us trade...


## SEMANTIC SEARCH WITH SPACY EMBEDDINGS

- spaCy provides pre-trained word vectors (300-dimensional GloVe embeddings)
- that capture semantic meaning beyond exact keyword matches.

Download large spaCy model with word vectors

In [None]:
!python -m spacy download en_core_web_lg 

In [27]:
import spacy
import numpy as np
import os
import faiss

Load large spaCy model with word vectors

In [28]:
nlp = spacy.load("en_core_web_lg")

In [29]:
def compute_spacy_doc_vectors(df, text_col="text", save_path="../data/embeddings/spacy_doc_vectors.npy"):
    """
    Compute and cache document embeddings using spaCy word vectors.
    
    spaCy's doc.vector averages the word vectors of all tokens,
    creating a single dense representation of the document's meaning.
    
    Args:
        df (DataFrame): Articles dataframe
        text_col (str): Name of text column
        save_path (str): Path to cache embeddings
        
    Returns:
        ndarray: Matrix of document vectors (n_docs x 300)
    """
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    
    # Load from cache if exists
    if os.path.exists(save_path):
        return np.load(save_path)
    
    # Compute document vectors (averaging word embeddings)
    doc_vectors = np.vstack([nlp(text).vector for text in df[text_col]])
    
    # Cache for future use
    np.save(save_path, doc_vectors)
    
    return doc_vectors

In [30]:
def build_faiss_index(doc_vectors, save_path="../data/embeddings/faiss_index.bin"):
    """
    Build FAISS index for fast approximate nearest neighbor search.
    
    FAISS (Facebook AI Similarity Search) enables efficient similarity 
    search in high-dimensional spaces using optimized algorithms.
    
    Args:
        doc_vectors (ndarray): Document embedding matrix
        save_path (str): Path to save FAISS index
        
    Returns:
        faiss.Index: FAISS index object
    """
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    
    # Ensure float32 dtype (required by FAISS)
    doc_vectors = doc_vectors.astype("float32")
    
    # Normalize vectors for cosine similarity
    # After normalization, inner product = cosine similarity
    faiss.normalize_L2(doc_vectors)
    
    # Create FAISS index using inner product (cosine similarity)
    index = faiss.IndexFlatIP(doc_vectors.shape[1])
    index.add(doc_vectors)
    
    # Persist index to disk
    faiss.write_index(index, save_path)
    
    return index

In [31]:
def search_articles_spacy(query, df, index, top_n=5):
    """
    Perform semantic search using FAISS + spaCy embeddings.
    
    This approach finds articles semantically similar to the query,
    even if they don't share exact keywords.
    
    Args:
        query (str): User search query
        df (DataFrame): Articles dataframe
        index (faiss.Index): Pre-built FAISS index
        top_n (int): Number of results to return
        
    Returns:
        DataFrame: Top N most similar articles with similarity scores
    """
    # Clean query and convert to vector
    query = clean_text(query)
    query_vec = nlp(query).vector.astype("float32").reshape(1, -1)
    faiss.normalize_L2(query_vec)
    
    # Perform fast similarity search
    distances, indices = index.search(query_vec, top_n)
    
    # Build results DataFrame
    results = df.iloc[indices[0]].copy()
    results["similarity"] = distances[0]
    
    return results[["id", "category", "similarity", "text"]]

Compute embeddings and build FAISS index

In [32]:
doc_vectors_spacy = compute_spacy_doc_vectors(df)
index_spacy = build_faiss_index(doc_vectors_spacy,save_path="../data/embeddings/faiss_index_spacy.bin")

Test spaCy semantic search

In [33]:
sentence_query="Analyze China's export surge of 35% that propelled its trade surplus to a six-year high, identifying the key export sectors, the role of currency valuation, the impact on global trade balances, and the responses from trading partners."
results_spacy = search_articles_spacy(sentence_query, df,index_spacy, top_n=5)


In [34]:
results_spacy

Unnamed: 0,id,category,similarity,text
66,66,business,0.920587,fao warns on impact of subsidies billions of f...
171,171,business,0.911932,newest eu members underpin growth the european...
426,426,business,0.905781,chinese exports rise 25 in 2004 exports from c...
282,282,business,0.905367,stock market eyes japan recovery japanese shar...
339,339,business,0.905277,venezuela and china sign oil deal venezuelan p...


## SEMANTIC SEARCH WITH TRANSFORMER EMBEDDINGS

- Transformer models (BERT-based) provide state-of-the-art contextual 
- embeddings that capture nuanced semantic meanin

In [37]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import os

- Initialize sentence transformer model
- all-MiniLM-L6-v2: Fast and efficient model with 384-dimensional embeddings

In [38]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [39]:
def compute_transformer_doc_vectors(df, text_col="text", cache_path="../data/embeddings/transformer_doc_vectors.npy"):
    """
    Compute and cache SentenceTransformer embeddings for each document.
    
    SentenceTransformers produce high-quality sentence embeddings using
    BERT-based models fine-tuned for semantic similarity tasks.
    
    Args:
        df (DataFrame): Articles dataframe
        text_col (str): Name of text column
        cache_path (str): Path to cache embeddings
        
    Returns:
        ndarray: Matrix of document vectors (n_docs x 384)
    """
    # Load from cache if available
    if os.path.exists(cache_path):
        embeddings = np.load(cache_path)
    else:
        # Encode all documents (batch processing with progress bar)
        embeddings = embedder.encode(
            df[text_col].tolist(), 
            convert_to_numpy=True, 
            show_progress_bar=True
        )
        # Cache embeddings
        np.save(cache_path, embeddings)
    
    return embeddings

In [40]:
def search_articles_semantic(query, df, index, top_n=5):
    """
    Perform semantic search using SentenceTransformer embeddings + FAISS.
    
    This approach provides the most sophisticated semantic understanding,
    capturing context and meaning beyond individual words.
    
    Args:
        query (str): User search query
        df (DataFrame): Articles dataframe
        index (faiss.Index): Pre-built FAISS index
        top_n (int): Number of results to return
        
    Returns:
        DataFrame: Top N most similar articles with similarity scores
    """
    # Clean and encode query
    query = clean_text(query)
    query_vec = embedder.encode([query], convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(query_vec)
    
    # Perform similarity search
    distances, indices = index.search(query_vec, top_n)
    
    # Build results
    results = df.iloc[indices[0]].copy()
    results["similarity"] = distances[0]
    
    return results[["id", "category", "similarity", "text"]]

Compute transformer embeddings and build FAISS index

In [41]:
doc_vectors_semantic = compute_transformer_doc_vectors(df)
index_semantic = build_faiss_index(doc_vectors_semantic,save_path="../data/embeddings/faiss_index_semantic.bin")

Test transformer-based semantic search

In [42]:
sentence_query="Analyze China's export surge of 35% that propelled its trade surplus to a six-year high, identifying the key export sectors, the role of currency valuation, the impact on global trade balances, and the responses from trading partners."
results_semantic = search_articles_semantic(sentence_query, df,index_semantic, top_n=5)

In [43]:
results_semantic

Unnamed: 0,id,category,similarity,text
426,426,business,0.649032,chinese exports rise 25 in 2004 exports from c...
428,428,business,0.647292,us trade gap ballooned in october the us trade...
15,15,business,0.630536,china keeps tight rein on credit chinas effort...
504,504,business,0.610793,china now top trader with japan china overtook...
23,23,business,0.591622,us trade gap hits record in 2004 the gap betwe...


## Evaluation

- Compare all three search approaches using standard information retrieval
- metrics: Precision@K, NDCG@K, and Mean Average Precision (MAP)

In [44]:
import pandas as pd
import numpy as np
from sklearn.metrics import ndcg_score


In [45]:
def load_ground_truth(csv_path):
    """
    Load ground truth relevance judgments from CSV.
    
    Ground truth format: Each row contains a query and 5 relevant doc IDs
    ordered by relevance (id_1 is most relevant).
    
    Args:
        csv_path (str): Path to evaluation CSV
        
    Returns:
        dict: Mapping of query -> list of relevant doc IDs
    """
    df_eval = pd.read_csv(csv_path)
    ground_truth = {}
    
    for _, row in df_eval.iterrows():
        # Extract relevant IDs in order of decreasing relevance
        relevant_ids = []
        for col in ["id_1", "id_2", "id_3", "id_4", "id_5"]:
            if pd.notna(row[col]):
                relevant_ids.append(int(row[col]))
        ground_truth[row["query"]] = relevant_ids
    
    return ground_truth

In [46]:
def precision_at_k(retrieved_ids, relevant_ids, k=5):
    """
    Calculate Precision@K metric.
    
    Precision@K measures the proportion of relevant documents 
    in the top-K retrieved results.
    
    Formula: P@K = (# relevant docs in top-K) / K
    
    Args:
        retrieved_ids (list): Retrieved document IDs (in rank order)
        relevant_ids (list): Ground truth relevant document IDs
        k (int): Cut-off rank position
        
    Returns:
        float: Precision score between 0 and 1
    """
    if not retrieved_ids:
        return 0.0
    
    retrieved_k = retrieved_ids[:k]
    relevant_set = set(relevant_ids)
    hits = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
    
    return hits / min(k, len(retrieved_k))

In [48]:
def ndcg_at_k(retrieved_ids, relevant_ids, k=5):
    """
    Compute Normalized Discounted Cumulative Gain (NDCG@K).
    
    NDCG@K measures how well the retrieved results are ranked,
    giving higher weight to relevant documents at top positions.
    
    Formula: NDCG = DCG / IDCG
    - DCG: Sum of (relevance / log2(rank+1))
    - IDCG: DCG of ideal ranking
    
    Args:
        retrieved_ids (list): Retrieved document IDs (in rank order)
        relevant_ids (list): Ground truth relevant IDs (in relevance order)
        k (int): Cut-off rank position
        
    Returns:
        float: NDCG score between 0 and 1
    """
    if not retrieved_ids or not relevant_ids:
        return 0.0
    
    # Assign decreasing relevance scores (id_1 = most relevant)
    relevance_map = {
        doc_id: len(relevant_ids) - i 
        for i, doc_id in enumerate(relevant_ids)
    }
    
    # Build relevance vector for retrieved docs
    retrieved_k = retrieved_ids[:k]
    y_true = np.array([[relevance_map.get(doc_id, 0) for doc_id in retrieved_k]])
    y_score = np.array([[k - i for i in range(len(retrieved_k))]])
    
    if np.sum(y_true) == 0:
        return 0.0
    
    return float(ndcg_score(y_true, y_score, k=k))

In [49]:
def average_precision(retrieved_ids, relevant_ids):
    """
    Compute Average Precision (AP).
    
    AP is the average of precision values computed at each position
    where a relevant document is retrieved.
    
    Formula: AP = (1/|relevant|) * Σ(P@k * rel(k))
    where rel(k) = 1 if doc at position k is relevant, else 0
    
    Args:
        retrieved_ids (list): Retrieved document IDs (in rank order)
        relevant_ids (list): Ground truth relevant document IDs
        
    Returns:
        float: Average Precision score between 0 and 1
    """
    if not relevant_ids:
        return 0.0
    
    relevant_set = set(relevant_ids)
    score = 0.0
    hits = 0
    
    # For each retrieved document
    for i, doc_id in enumerate(retrieved_ids):
        if doc_id in relevant_set:
            hits += 1
            # Add precision at this position
            score += hits / (i + 1)
    
    return score / len(relevant_set)

In [52]:
def evaluate_model(search_function, ground_truth, df, model_name="Model", k=5, **kwargs):
    """
    Evaluate a search model using all metrics.
    
    Computes Precision@K, NDCG@K, and MAP for all queries in ground truth
    and returns averaged results.
    
    Args:
        search_function: Function that takes (query, df, top_n, **kwargs)
        ground_truth: Dict mapping queries to relevant doc IDs
        df: Articles DataFrame
        model_name: Name for display in results
        k: Number of results to evaluate
        **kwargs: Additional args for search_function (e.g., index, bm25)
        
    Returns:
        dict: Evaluation metrics averaged across all queries
    """
    all_retrieved = {}
    precisions = []
    ndcgs = []
    aps = []
    
    # Evaluate each query
    for query, relevant_ids in ground_truth.items():
        # Retrieve results using specified search function
        results = search_function(query, df, top_n=k, **kwargs)
        retrieved_ids = results["id"].tolist()
        all_retrieved[query] = retrieved_ids
        
        # Calculate metrics for this query
        prec = precision_at_k(retrieved_ids, relevant_ids, k=k)
        ndcg = ndcg_at_k(retrieved_ids, relevant_ids, k=k)
        ap = average_precision(retrieved_ids, relevant_ids)
        
        precisions.append(prec)
        ndcgs.append(ndcg)
        aps.append(ap)
    
    # Return averaged metrics
    results = {
        "Model": model_name,
        f"Precision@{k}": np.mean(precisions),
        f"NDCG@{k}": np.mean(ndcgs),
        "MAP": np.mean(aps)
    }
    
    return results

In [53]:
def run_complete_evaluation(ground_truth, df, bm25, index_spacy, index_semantic, k=5):
    """
    Run evaluation for all three models and compare results.
    
    This function evaluates:
    1. Lexical Search (BM25)
    2. Semantic Search (spaCy embeddings)
    3. Semantic Search (Transformer embeddings)
    
    Args:
        ground_truth: Dict of query -> relevant doc IDs
        df: Articles DataFrame
        bm25: BM25 model object
        index_spacy: FAISS index for spaCy embeddings
        index_semantic: FAISS index for transformer embeddings
        k: Number of results to evaluate
        
    Returns:
        DataFrame: Comparison of all models' performance
    """
    results = []
    
    # 1. Evaluate BM25 (Lexical Search)
    bm25_results = evaluate_model(search_articles_bm25,ground_truth,df,model_name="Lexical model",k=k)
    results.append(bm25_results)
    
    # 2. Evaluate spaCy (Semantic Search with GloVe)
    spacy_results = evaluate_model( search_articles_spacy, ground_truth,df,model_name="spaCy model", k=k,index=index_spacy)
    results.append(spacy_results)
    
    # 3. Evaluate Transformer (Semantic Search with BERT)
    transformer_results = evaluate_model( search_articles_semantic,ground_truth, df, model_name="semantic model",k=k, index=index_semantic)
    results.append(transformer_results)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(results)
    
    return comparison_df

RUN COMPLETE EVALUATION

Load ground truth relevance judgments

In [54]:
ground_truth = load_ground_truth("../data/evaluation/evaluation.csv")

Evaluate all three models

In [55]:
evaluation_results = run_complete_evaluation(ground_truth=ground_truth,df=df,bm25=bm25,
                                             index_spacy=index_spacy,index_semantic=index_semantic,k=5)

Display comparison results

In [56]:
evaluation_results

Unnamed: 0,Model,Precision@5,NDCG@5,MAP
0,Lexical model,0.228571,0.888587,0.202597
1,spaCy model,0.187013,0.612753,0.141515
2,semantic model,0.244156,0.932487,0.217532


RESULTS INTERPRETATION

- Precision@5: Proportion of relevant docs in top 5 results
- NDCG@5: Quality of ranking (1.0 = perfect ranking)
- MAP: Mean Average Precision across all queries

- Higher values indicate better performance.
- tansformer based Semantic models typically outperform lexical and spacy search for complex queries.
- lexical search outperform spacy search.