# Algorithm Variants in rapid_textrank

This notebook explores the three TextRank algorithm variants available in `rapid_textrank`:

| Variant | Best For | Key Feature |
|---------|----------|-------------|
| **BaseTextRank** | General text | Standard TextRank implementation |
| **PositionRank** | Academic papers, news | Favors words appearing early |
| **BiasedTextRank** | Topic-focused extraction | Biases toward specified focus terms |

In [None]:
# Install if needed
%pip install -q rapid_textrank

In [None]:
from rapid_textrank import BaseTextRank, PositionRank, BiasedTextRank

## 1. BaseTextRank

The standard TextRank algorithm, based on [Mihalcea & Tarau (2004)](https://aclanthology.org/W04-3252/).

**How it works:**
1. Builds a co-occurrence graph from content words
2. Runs PageRank to score word importance
3. Extracts phrases by grouping high-scoring words

**Best for:** General-purpose keyword extraction where word position doesn't matter.

In [None]:
text = """
Natural language processing (NLP) is a field of artificial intelligence
that focuses on the interaction between computers and humans through
natural language. The ultimate goal of NLP is to enable computers to
understand, interpret, and generate human language in a valuable way.

Machine learning approaches have transformed NLP in recent years.
Deep learning models, particularly transformers, have achieved
state-of-the-art results on many NLP tasks including translation,
summarization, and question answering.
"""

base = BaseTextRank(top_n=10, language="en")
result = base.extract_keywords(text)

print("BaseTextRank Results:")
print("=" * 50)
for p in result.phrases:
    print(f"{p.rank:>2}. {p.text:<35} {p.score:.4f}")

## 2. PositionRank

Based on [Florescu & Caragea (2017)](https://aclanthology.org/P17-1102/), PositionRank weights words by their position in the document.

**Key insight:** In many documents (papers, news articles, reports), important terms appear early—in titles, abstracts, or introductory paragraphs.

**How it differs from BaseTextRank:**
- Words appearing early get higher initial importance
- Position weight decays as you move through the document
- The PageRank algorithm then refines these position-biased scores

**Best for:** Academic papers, news articles, structured documents with front-loaded information.

In [None]:
# Academic abstract where key terms appear in the title/first sentence
abstract = """
Quantum Error Correction in Near-Term Quantum Computers

We present a novel approach to quantum error correction that significantly
reduces the overhead required for fault-tolerant quantum computation.
Our method leverages machine learning to predict and correct errors
in real-time. Experimental results on superconducting qubits demonstrate
a 50% reduction in logical error rates. These advances bring us closer
to practical quantum computing applications.
"""

pos = PositionRank(top_n=10, language="en")
result = pos.extract_keywords(abstract)

print("PositionRank Results:")
print("=" * 50)
for p in result.phrases:
    print(f"{p.rank:>2}. {p.text:<35} {p.score:.4f}")

### BaseTextRank vs PositionRank: Side-by-Side

Let's compare both algorithms on the same text to see how position weighting affects results:

In [None]:
# Same abstract, both algorithms
base = BaseTextRank(top_n=5, language="en")
pos = PositionRank(top_n=5, language="en")

base_result = base.extract_keywords(abstract)
pos_result = pos.extract_keywords(abstract)

print(f"{'BaseTextRank':<35} {'PositionRank':<35}")
print("=" * 70)

for i in range(5):
    base_phrase = base_result.phrases[i]
    pos_phrase = pos_result.phrases[i]
    print(f"{base_phrase.text:<35} {pos_phrase.text:<35}")

In [None]:
# Let's see which phrases are unique to each algorithm
base_texts = {p.text for p in base_result.phrases}
pos_texts = {p.text for p in pos_result.phrases}

only_base = base_texts - pos_texts
only_pos = pos_texts - base_texts
both = base_texts & pos_texts

print(f"In both: {both}")
print(f"Only in BaseTextRank: {only_base}")
print(f"Only in PositionRank: {only_pos}")

## 3. BiasedTextRank

Based on [Kazemi et al. (2020)](https://aclanthology.org/2020.coling-main.144/), BiasedTextRank steers extraction toward specified focus terms.

**Key parameters:**
- `focus_terms`: List of terms to bias toward
- `bias_weight`: How strongly to favor focus terms (higher = stronger bias)

**How it works:**
- Focus terms get an initial boost in the PageRank algorithm
- Words connected to focus terms inherit some of this bias
- The result emphasizes the topic you care about

**Best for:** Topic-specific extraction, document filtering, aspect-based analysis.

In [None]:
# Text covering multiple topics
tech_article = """
Modern web applications must balance user experience with security.
Performance optimizations are crucial for mobile users on slow networks.
Privacy regulations like GDPR require careful data handling and consent.
Security vulnerabilities can expose sensitive user information.
Caching strategies improve response times but complicate data freshness.
Authentication systems must prevent unauthorized access while remaining
user-friendly. Encryption protects data both in transit and at rest.
"""

# Focus on security/privacy topics
biased = BiasedTextRank(
    focus_terms=["security", "privacy"],
    bias_weight=5.0,
    top_n=10,
    language="en"
)

result = biased.extract_keywords(tech_article)

print("BiasedTextRank (focus: security, privacy):")
print("=" * 50)
for p in result.phrases:
    print(f"{p.rank:>2}. {p.text:<35} {p.score:.4f}")

### Effect of `bias_weight`

The `bias_weight` parameter controls how strongly results favor the focus terms:

In [None]:
# Compare different bias weights
weights = [1.0, 2.0, 5.0, 10.0]

print("Bias Weight Comparison (focus: security, privacy)")
print("=" * 80)

for weight in weights:
    biased = BiasedTextRank(
        focus_terms=["security", "privacy"],
        bias_weight=weight,
        top_n=3,
        language="en"
    )
    result = biased.extract_keywords(tech_article)
    
    print(f"\nbias_weight={weight}:")
    for p in result.phrases:
        print(f"  {p.text}: {p.score:.4f}")

In [None]:
# Compare biased vs unbiased on the same text
base = BaseTextRank(top_n=5, language="en")
biased_security = BiasedTextRank(
    focus_terms=["security", "privacy"],
    bias_weight=5.0,
    top_n=5,
    language="en"
)
biased_perf = BiasedTextRank(
    focus_terms=["performance", "speed", "cache"],
    bias_weight=5.0,
    top_n=5,
    language="en"
)

base_result = base.extract_keywords(tech_article)
security_result = biased_security.extract_keywords(tech_article)
perf_result = biased_perf.extract_keywords(tech_article)

print(f"{'Unbiased':<25} {'Security Focus':<25} {'Performance Focus':<25}")
print("=" * 75)
for i in range(5):
    print(f"{base_result.phrases[i].text:<25} "
          f"{security_result.phrases[i].text:<25} "
          f"{perf_result.phrases[i].text:<25}")

### Dynamic Focus Terms

You can also pass `focus_terms` per-call, which is useful when processing multiple documents with different topics:

In [None]:
# Create extractor with default focus
extractor = BiasedTextRank(
    focus_terms=["default"],  # Placeholder
    bias_weight=5.0,
    top_n=5,
    language="en"
)

# Override focus_terms per call
topics = [
    ["security", "encryption"],
    ["performance", "caching"],
    ["user", "experience"]
]

for focus in topics:
    result = extractor.extract_keywords(tech_article, focus_terms=focus)
    print(f"\nFocus: {focus}")
    for p in result.phrases[:3]:
        print(f"  - {p.text}")

## JSON Batch API

For processing large volumes of pre-tokenized documents, use the JSON interface:

- `extract_from_json()` - Single document
- `extract_batch_from_json()` - Multiple documents

This is particularly useful when:
- You've already tokenized with spaCy or another NLP library
- You're processing many documents in batch
- You need fine-grained control over tokenization

In [None]:
import json
from rapid_textrank import extract_from_json

# Single document with pre-tokenized input
doc = {
    "tokens": [
        {"text": "Machine", "lemma": "machine", "pos": "NOUN",
         "start": 0, "end": 7, "sentence_idx": 0, "token_idx": 0, "is_stopword": False},
        {"text": "learning", "lemma": "learning", "pos": "NOUN",
         "start": 8, "end": 16, "sentence_idx": 0, "token_idx": 1, "is_stopword": False},
        {"text": "is", "lemma": "be", "pos": "AUX",
         "start": 17, "end": 19, "sentence_idx": 0, "token_idx": 2, "is_stopword": True},
        {"text": "transforming", "lemma": "transform", "pos": "VERB",
         "start": 20, "end": 32, "sentence_idx": 0, "token_idx": 3, "is_stopword": False},
        {"text": "industries", "lemma": "industry", "pos": "NOUN",
         "start": 33, "end": 43, "sentence_idx": 0, "token_idx": 4, "is_stopword": False},
    ],
    "config": {
        "top_n": 5,
        "window_size": 4,
        "damping": 0.85
    }
}

result_json = extract_from_json(json.dumps(doc))
result = json.loads(result_json)

print("Single document result:")
for phrase in result["phrases"]:
    print(f"  {phrase['text']}: {phrase['score']:.4f}")

In [None]:
from rapid_textrank import extract_batch_from_json

# Batch processing multiple documents
docs = [
    {
        "tokens": [
            {"text": "Deep", "lemma": "deep", "pos": "ADJ",
             "start": 0, "end": 4, "sentence_idx": 0, "token_idx": 0, "is_stopword": False},
            {"text": "learning", "lemma": "learning", "pos": "NOUN",
             "start": 5, "end": 13, "sentence_idx": 0, "token_idx": 1, "is_stopword": False},
            {"text": "models", "lemma": "model", "pos": "NOUN",
             "start": 14, "end": 20, "sentence_idx": 0, "token_idx": 2, "is_stopword": False},
        ],
        "config": {"top_n": 3}
    },
    {
        "tokens": [
            {"text": "Neural", "lemma": "neural", "pos": "ADJ",
             "start": 0, "end": 6, "sentence_idx": 0, "token_idx": 0, "is_stopword": False},
            {"text": "networks", "lemma": "network", "pos": "NOUN",
             "start": 7, "end": 15, "sentence_idx": 0, "token_idx": 1, "is_stopword": False},
            {"text": "process", "lemma": "process", "pos": "VERB",
             "start": 16, "end": 23, "sentence_idx": 0, "token_idx": 2, "is_stopword": False},
            {"text": "data", "lemma": "data", "pos": "NOUN",
             "start": 24, "end": 28, "sentence_idx": 0, "token_idx": 3, "is_stopword": False},
        ],
        "config": {"top_n": 3}
    }
]

results_json = extract_batch_from_json(json.dumps(docs))
results = json.loads(results_json)

print("Batch results:")
for i, result in enumerate(results):
    print(f"\nDocument {i}:")
    for phrase in result["phrases"]:
        print(f"  - {phrase['text']} ({phrase['score']:.4f})")

## Decision Guide: Which Variant to Use?

```
                                START
                                  │
                                  ▼
                    ┌─────────────────────────┐
                    │ Do you have specific    │
                    │ topics to focus on?     │
                    └─────────────────────────┘
                         │              │
                        YES             NO
                         │              │
                         ▼              ▼
               ┌──────────────┐  ┌─────────────────────────┐
               │ BiasedTextRank│  │ Is key info at the     │
               │              │  │ beginning of the doc?   │
               └──────────────┘  └─────────────────────────┘
                                       │              │
                                      YES             NO
                                       │              │
                                       ▼              ▼
                              ┌──────────────┐ ┌──────────────┐
                              │ PositionRank │ │ BaseTextRank │
                              └──────────────┘ └──────────────┘
```

### Recommendations by Document Type

| Document Type | Recommended Variant | Why |
|---------------|---------------------|-----|
| Blog posts, articles | BaseTextRank | General content, no position bias needed |
| Academic papers | PositionRank | Key terms in title/abstract |
| News articles | PositionRank | Lead paragraphs contain key info |
| Product reviews | BiasedTextRank | Focus on features you care about |
| Support tickets | BiasedTextRank | Focus on problem categories |
| Legal documents | BaseTextRank | Important terms throughout |

## Next Steps

- **[03_explain_algorithm.ipynb](03_explain_algorithm.ipynb)** - Visual explanation of how TextRank works internally
- **[04_benchmarks.ipynb](04_benchmarks.ipynb)** - Performance comparison with pytextrank