# Hard Negatives Mining

This notebook implements hard negatives mining as specified in the research paper:

**"Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers"**

## Hard Negatives Mining Strategy

The paper uses a two-step approach:

1. **BM25 Retrieval**: For each query, retrieve top-M (typically M=1000) candidate documents
2. **Consistency Filtering**: Keep only training samples where the positive document appears in top-10 BM25 results
3. **Sampling**: From the remaining candidates, sample N (typically N=7) hard negatives per query

## Benefits

- Improves model's ability to distinguish between relevant and near-relevant documents
- Focuses training on challenging examples
- Increases training effectiveness without increasing dataset size

## Implementation

We'll use:
- **BM25** from `rank-bm25` library for initial retrieval
- **Consistency filtering** to ensure positive document is retrievable
- **Random sampling** for hard negatives selection

In [1]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('../..')

from pathlib import Path
import json
import random
from typing import Dict, List, Tuple
from tqdm import tqdm
import numpy as np

## 1. Setup and Configuration

In [3]:
# Input directories
paired_data_dir = Path("../../dataset/paired_data")
pretraining_data_dir = Path("../../dataset/pretraining")

# Output directory
output_dir = Path("../../dataset/hard_negatives")
output_dir.mkdir(parents=True, exist_ok=True)

# Mining parameters (from paper)
TOP_M = 1000  # Number of candidates to retrieve
TOP_K_FILTER = 10  # Consistency filter: positive must be in top-K
NUM_NEGATIVES = 7  # Number of hard negatives to sample
CHUNK_SIZE = 100000  # Pairs per output file
SKIP_IF_EXISTS = True

print("Hard Negatives Mining Configuration:")
print(f"  Top-M candidates: {TOP_M}")
print(f"  Consistency filter (top-K): {TOP_K_FILTER}")
print(f"  Hard negatives per query: {NUM_NEGATIVES}")
print(f"\n✓ Output directory: {output_dir}")

Hard Negatives Mining Configuration:
  Top-M candidates: 1000
  Consistency filter (top-K): 10
  Hard negatives per query: 7

✓ Output directory: ../../dataset/hard_negatives


## 2. Install BM25 Library

In [4]:
# Install rank-bm25 if not already installed
try:
    from rank_bm25 import BM25Okapi
    print("✓ rank_bm25 already installed")
except ImportError:
    print("Installing rank_bm25...")
    %pip install rank-bm25
    from rank_bm25 import BM25Okapi
    print("✓ rank_bm25 installed")

✓ rank_bm25 already installed


## 3. Define Hard Negatives Miner

In [5]:
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data if needed
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

class HardNegativesMiner:
    """Mine hard negatives using BM25 and consistency filtering."""
    
    def __init__(
        self,
        top_m: int = 1000,
        top_k_filter: int = 10,
        num_negatives: int = 7,
    ):
        """
        Initialize hard negatives miner.
        
        Args:
            top_m: Number of top candidates to retrieve with BM25
            top_k_filter: Consistency filter threshold (positive must be in top-K)
            num_negatives: Number of hard negatives to sample per query
        """
        self.top_m = top_m
        self.top_k_filter = top_k_filter
        self.num_negatives = num_negatives
    
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text for BM25."""
        try:
            return word_tokenize(text.lower())
        except:
            # Fallback to simple split
            return text.lower().split()
    
    def build_bm25_index(self, documents: List[str]) -> BM25Okapi:
        """Build BM25 index from documents."""
        print(f"Tokenizing {len(documents):,} documents...")
        tokenized_docs = [self.tokenize(doc) for doc in tqdm(documents, desc="Tokenizing")]
        
        print("Building BM25 index...")
        bm25 = BM25Okapi(tokenized_docs)
        print("✓ BM25 index built")
        
        return bm25
    
    def mine_hard_negatives(
        self,
        query: str,
        positive_doc: str,
        all_documents: List[str],
        bm25_index: BM25Okapi,
    ) -> Tuple[bool, List[int]]:
        """
        Mine hard negatives for a single query.
        
        Args:
            query: Query text
            positive_doc: Positive (relevant) document
            all_documents: List of all candidate documents
            bm25_index: Pre-built BM25 index
        
        Returns:
            (passes_filter, hard_negative_indices)
            passes_filter: True if positive doc is in top-K
            hard_negative_indices: Indices of sampled hard negatives
        """
        # Tokenize query
        tokenized_query = self.tokenize(query)
        
        # Get BM25 scores for all documents
        scores = bm25_index.get_scores(tokenized_query)
        
        # Get top-M document indices
        top_m_indices = np.argsort(scores)[::-1][:self.top_m]
        
        # Find positive document index
        try:
            positive_idx = all_documents.index(positive_doc)
        except ValueError:
            # Positive doc not in corpus (shouldn't happen, but handle gracefully)
            return False, []
        
        # Consistency filtering: check if positive is in top-K
        top_k_indices = top_m_indices[:self.top_k_filter]
        passes_filter = positive_idx in top_k_indices
        
        if not passes_filter:
            return False, []
        
        # Sample hard negatives from top-M (excluding positive)
        candidate_negatives = [idx for idx in top_m_indices if idx != positive_idx]
        
        # Sample N hard negatives
        num_to_sample = min(self.num_negatives, len(candidate_negatives))
        hard_negative_indices = random.sample(candidate_negatives, num_to_sample)
        
        return True, hard_negative_indices

print("✓ HardNegativesMiner class defined")

[nltk_data] Downloading package punkt to /home/west/nltk_data...


✓ HardNegativesMiner class defined


[nltk_data]   Unzipping tokenizers/punkt.zip.


## 4. Load Paired Data and Build Document Corpus

In [6]:
import glob

def load_paired_data(data_dir: Path, pattern: str, max_samples: int = None) -> List[Dict]:
    """Load paired data from JSONL files."""
    files = sorted(glob.glob(str(data_dir / pattern)))
    
    if not files:
        print(f"⚠ No files found matching pattern: {pattern}")
        return []
    
    print(f"Loading from {len(files)} files matching {pattern}...")
    
    data = []
    for file_path in tqdm(files, desc="Loading files"):
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                data.append(json.loads(line))
                if max_samples and len(data) >= max_samples:
                    break
        
        if max_samples and len(data) >= max_samples:
            break
    
    print(f"✓ Loaded {len(data):,} paired samples")
    return data

# For demonstration, load a subset of Korean Wikipedia data
# In production, you'd process all datasets
print("=" * 80)
print("Loading Korean Wikipedia paired data (sample)")
print("=" * 80)

# Load sample data (first 10K for demo)
ko_wiki_data = load_paired_data(
    paired_data_dir,
    "ko_wiki_title_summary_*.jsonl",
    max_samples=10000  # Limit for demo; remove in production
)

print(f"\n✓ Sample size: {len(ko_wiki_data):,} pairs")

Loading Korean Wikipedia paired data (sample)
Loading from 12 files matching ko_wiki_title_summary_*.jsonl...


Loading files:   0%|          | 0/12 [00:00<?, ?it/s]

✓ Loaded 10,000 paired samples

✓ Sample size: 10,000 pairs





## 5. Build BM25 Index

Build a BM25 index over all documents in the corpus.

In [7]:
if len(ko_wiki_data) > 0:
    print("=" * 80)
    print("Building BM25 index")
    print("=" * 80)
    
    # Extract all documents
    all_documents = [item["document"] for item in ko_wiki_data]
    
    print(f"Documents in corpus: {len(all_documents):,}")
    
    # Initialize miner
    miner = HardNegativesMiner(
        top_m=TOP_M,
        top_k_filter=TOP_K_FILTER,
        num_negatives=NUM_NEGATIVES,
    )
    
    # Build BM25 index
    bm25_index = miner.build_bm25_index(all_documents)
    
    print("\n✓ BM25 index ready")
else:
    print("⚠ No data loaded, skipping BM25 index building")

Building BM25 index
Documents in corpus: 10,000
Tokenizing 10,000 documents...


Tokenizing: 100%|██████████| 10000/10000 [00:01<00:00, 8943.00it/s]


Building BM25 index...
✓ BM25 index built

✓ BM25 index ready


## 6. Mine Hard Negatives

For each query-document pair, mine hard negatives using BM25 and consistency filtering.

In [8]:
if len(ko_wiki_data) > 0 and 'bm25_index' in locals():
    print("=" * 80)
    print("Mining hard negatives")
    print("=" * 80)
    
    results = []
    passed_filter_count = 0
    failed_filter_count = 0
    
    for item in tqdm(ko_wiki_data, desc="Mining hard negatives"):
        query = item["query"]
        positive_doc = item["document"]
        
        # Mine hard negatives
        passes_filter, hard_neg_indices = miner.mine_hard_negatives(
            query=query,
            positive_doc=positive_doc,
            all_documents=all_documents,
            bm25_index=bm25_index,
        )
        
        if passes_filter:
            passed_filter_count += 1
            
            # Get hard negative documents
            hard_negatives = [all_documents[idx] for idx in hard_neg_indices]
            
            result = {
                "query": query,
                "positive_doc": positive_doc,
                "hard_negatives": hard_negatives,
                "query_type": item.get("query_type", ""),
                "doc_type": item.get("doc_type", ""),
                "source": item.get("source", ""),
                "language": item.get("language", ""),
            }
            
            results.append(result)
        else:
            failed_filter_count += 1
    
    print(f"\n✓ Mining complete")
    print(f"  Passed consistency filter: {passed_filter_count:,} ({passed_filter_count/len(ko_wiki_data)*100:.1f}%)")
    print(f"  Failed consistency filter: {failed_filter_count:,} ({failed_filter_count/len(ko_wiki_data)*100:.1f}%)")
    print(f"  Total results: {len(results):,}")
else:
    print("⚠ Skipping hard negatives mining (no data or index)")
    results = []

Mining hard negatives


Mining hard negatives: 100%|██████████| 10000/10000 [00:17<00:00, 574.94it/s]


✓ Mining complete
  Passed consistency filter: 3,253 (32.5%)
  Failed consistency filter: 6,747 (67.5%)
  Total results: 3,253





## 7. Save Results

In [9]:
if len(results) > 0:
    print("=" * 80)
    print("Saving hard negatives")
    print("=" * 80)
    
    # Save in chunks
    chunk_num = 0
    
    for i in range(0, len(results), CHUNK_SIZE):
        chunk = results[i:i + CHUNK_SIZE]
        chunk_num += 1
        
        output_file = output_dir / f"ko_wiki_hard_negatives_chunk_{chunk_num:03d}.jsonl"
        
        with open(output_file, 'w', encoding='utf-8') as f:
            for item in chunk:
                f.write(json.dumps(item, ensure_ascii=False) + "\n")
        
        print(f"Saved chunk {chunk_num}: {len(chunk):,} items to {output_file.name}")
    
    print(f"\n✓ Saved {len(results):,} hard negatives in {chunk_num} chunk(s)")
else:
    print("⚠ No results to save")

Saving hard negatives
Saved chunk 1: 3,253 items to ko_wiki_hard_negatives_chunk_001.jsonl

✓ Saved 3,253 hard negatives in 1 chunk(s)


## 8. Inspect Results

In [10]:
if len(results) > 0:
    print("=" * 80)
    print("SAMPLE HARD NEGATIVES")
    print("=" * 80)
    
    sample = results[0]
    
    print(f"\nQuery: {sample['query']}")
    print(f"\nPositive document:")
    print(f"  {sample['positive_doc'][:200]}...")
    
    print(f"\nHard negatives ({len(sample['hard_negatives'])}):")
    for i, neg in enumerate(sample['hard_negatives'][:3], 1):
        print(f"\n  {i}. {neg[:150]}...")
    
    if len(sample['hard_negatives']) > 3:
        print(f"\n  ... and {len(sample['hard_negatives']) - 3} more")
    
    print("\n" + "=" * 80)

SAMPLE HARD NEGATIVES

Query: 지미 카터

Positive document:
  제임스 얼 “지미” 카터 주니어(, 1924년 10월 1일~2024년 12월 29일)는 미국의 제39대 대통령 (1977-81)을 지낸 미국의 정치인이다. 민주당 소속으로 1963년부터 1967년까지 조지아주 상원 의원, 1971년부터 1975년까지 조지아주의 76대 주지사을 지냈다. 카터는 100세까지 산 최초의 대통령으로 미국 역사상 가장 장수한 대통령...

Hard negatives (7):

  1. 동명여자고등학교(東明女子高等學校)는 대한민국 서울특별시 은평구 대조동에 위치한 사립 고등학교이다. 학교 연혁 1921년 6월 3일 : 현 서대문구 천연동에 향상여자기예학교 개교 1936년 3월 7일 : 향상여자실업학교로 명칭 변경 1945년 11월 26일 : 재단법인 ...

  2. 개포고등학교(開浦高等學校)는 대한민국 서울특별시 강남구 개포동에 있는 공립 고등학교이다. 학교 연혁 1987년 1월 12일 : 개포고등학교 설립 인가 1987년 3월 1일 : 박노학 초대 교장 취임 1987년 3월 4일 : 제1회 입학 856명(남 428명, 여 428...

  3. 쌍떡잎식물(雙―植物, Magnoliopsida, )은 속씨식물 중 떡잎이 두 장 나는 것을 말하며, 쌍떡잎식물강으로 분류된다. 쌍자엽식물(雙子葉植物)로도 부른다. 쌍떡잎식물은 약 199,350 여 종이 존재한다....

  ... and 4 more



## 9. Statistics

In [11]:
if len(results) > 0:
    print("=" * 80)
    print("HARD NEGATIVES STATISTICS")
    print("=" * 80)
    
    # Calculate statistics
    num_hard_negs = [len(item['hard_negatives']) for item in results]
    
    print(f"\nTotal training samples: {len(results):,}")
    print(f"Hard negatives per sample:")
    print(f"  Mean: {np.mean(num_hard_negs):.2f}")
    print(f"  Min: {np.min(num_hard_negs)}")
    print(f"  Max: {np.max(num_hard_negs)}")
    
    # Total training examples (1 positive + N negatives per query)
    total_examples = sum(1 + len(item['hard_negatives']) for item in results)
    print(f"\nTotal training examples (pos + negs): {total_examples:,}")
    
    print("\n" + "=" * 80)

HARD NEGATIVES STATISTICS

Total training samples: 3,253
Hard negatives per sample:
  Mean: 7.00
  Min: 7
  Max: 7

Total training examples (pos + negs): 26,024



## Summary

This notebook implements hard negatives mining as specified in the research paper:

**Mining Strategy:**
1. ✓ BM25 retrieval of top-M candidates (M=1000)
2. ✓ Consistency filtering (positive must be in top-10)
3. ✓ Sample N=7 hard negatives per query

**Output Format:**
```json
{
  "query": "...",
  "positive_doc": "...",
  "hard_negatives": ["...", "...", ...],
  "query_type": "title",
  "doc_type": "summary",
  "source": "ko_wiki",
  "language": "ko"
}
```

**Next Steps:**
1. Process all datasets (Wikipedia, NamuWiki, pre-training datasets)
2. Prepare MS MARCO fine-tuning data (notebook 05)
3. Train model with:
   - Positive-negative pairs
   - Hard negatives for improved discrimination
   - IDF-aware penalty

**Note**: This demo processes a small sample (10K pairs). In production:
- Process all paired data sources
- Use larger BM25 corpus
- Consider distributed processing for large datasets
- Adjust TOP_M based on corpus size