# rapid_textrank Quick Start

This notebook demonstrates the basics of `rapid_textrank`, a high-performance TextRank implementation in Rust with Python bindings.

**What you'll learn:**
- How to extract keywords from text
- Understanding `Phrase` objects and their attributes
- Configuring the algorithm with `TextRankConfig`
- Multi-language support
- Optional spaCy integration

## Installation

Install `rapid_textrank` from PyPI:

In [1]:
%pip install -q rapid_textrank

Note: you may need to restart the kernel to use updated packages.


## Basic Usage

The simplest way to extract keywords is with the `extract_keywords()` function:

In [2]:
from rapid_textrank import extract_keywords

text = """
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience. Deep learning, a type of
machine learning, uses neural networks with many layers.
"""

# Extract the top 5 keywords
keywords = extract_keywords(text, top_n=5, language="en")

for phrase in keywords:
    print(f"{phrase.text}: {phrase.score:.4f}")

neural networks: 0.1375
artificial intelligence: 0.1354
subset: 0.0754
improve: 0.0738
learn: 0.0738


## Understanding Results: The `Phrase` Object

Each keyword is returned as a `Phrase` object with several useful attributes:

In [3]:
# Let's examine the first phrase in detail
phrase = keywords[0]

print("Phrase attributes:")
print(f"  text:  {phrase.text!r:30} # The original phrase text")
print(f"  lemma: {phrase.lemma!r:30} # Lemmatized (base) form")
print(f"  score: {phrase.score:<30.4f} # TextRank importance score")
print(f"  rank:  {phrase.rank:<30} # 1-indexed ranking position")
print(f"  count: {phrase.count:<30} # Number of occurrences in text")

Phrase attributes:
  text:  'neural networks'              # The original phrase text
  lemma: 'neural network'               # Lemmatized (base) form
  score: 0.1375                         # TextRank importance score
  rank:  1                              # 1-indexed ranking position
  count: 1                              # Number of occurrences in text


In [4]:
# View all keywords as a table
print(f"{'Rank':<6} {'Text':<35} {'Score':<10} {'Count':<6}")
print("-" * 60)
for p in keywords:
    print(f"{p.rank:<6} {p.text:<35} {p.score:<10.4f} {p.count:<6}")

Rank   Text                                Score      Count 
------------------------------------------------------------
1      neural networks                     0.1375     1     
2      artificial intelligence             0.1354     1     
3      subset                              0.0754     1     
4      improve                             0.0738     1     
5      learn                               0.0738     1     


## Configuration with `TextRankConfig`

For more control over the algorithm, use `TextRankConfig` with the class-based API:

In [5]:
from rapid_textrank import TextRankConfig, BaseTextRank

# Create a custom configuration (only overriding a few defaults)
config = TextRankConfig(
    top_n=10,
    score_aggregation="sum",
    language="en",
)

# Create an extractor with the config
extractor = BaseTextRank(config=config)

# Extract keywords
result = extractor.extract_keywords(text)

print(f"Converged: {result.converged}")
print(f"Iterations: {result.iterations}")
print(f"Top phrases:")
for p in result.phrases[:5]:
    print(f"  {p.rank}. {p.text}: {p.score:.4f}")


Converged: True
Iterations: 23
Top phrases:
  1. neural networks: 0.1375
  2. artificial intelligence: 0.1354
  3. subset: 0.0754
  4. improve: 0.0738
  5. learn: 0.0738


### Key Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `damping` | 0.85 | PageRank damping factor. Higher = more weight to graph structure |
| `window_size` | 4 | Co-occurrence window. Larger = more connections between distant words |
| `min_phrase_length` | 1 | Allow single words as phrases |
| `max_phrase_length` | 4 | Maximum words in a keyphrase |
| `score_aggregation` | "sum" | How to combine word scores: sum, mean, max, or rms |

In [6]:
# Compare different score aggregation methods
aggregation_methods = ["sum", "mean", "max", "rms"]

print("Score Aggregation Comparison (top 3 phrases):")
print("=" * 60)

for method in aggregation_methods:
    config = TextRankConfig(score_aggregation=method, top_n=3, language="en")
    extractor = BaseTextRank(config=config)
    result = extractor.extract_keywords(text)
    
    print(f"\n{method.upper()}:")
    for p in result.phrases:
        print(f"  {p.text}: {p.score:.4f}")

Score Aggregation Comparison (top 3 phrases):

SUM:
  neural networks: 0.1375
  artificial intelligence: 0.1354
  subset: 0.0754

MEAN:
  subset: 0.0754
  improve: 0.0738
  learn: 0.0738

MAX:
  neural networks: 0.0778
  artificial intelligence: 0.0759
  subset: 0.0754

RMS:
  subset: 0.0754
  improve: 0.0738
  learn: 0.0738


## Multi-Language Support

`rapid_textrank` supports stopword filtering for 18 languages:

| Code | Language | Code | Language | Code | Language |
|------|----------|------|----------|------|----------|
| `en` | English | `de` | German | `fr` | French |
| `es` | Spanish | `it` | Italian | `pt` | Portuguese |
| `nl` | Dutch | `ru` | Russian | `sv` | Swedish |
| `no` | Norwegian | `da` | Danish | `fi` | Finnish |
| `hu` | Hungarian | `tr` | Turkish | `pl` | Polish |
| `ar` | Arabic | `zh` | Chinese | `ja` | Japanese |

In [7]:
# German example
german_text = """
Maschinelles Lernen ist ein Teilgebiet der künstlichen Intelligenz.
Deep Learning verwendet neuronale Netze mit vielen Schichten.
Diese Technologie ermöglicht es Computern, aus Erfahrung zu lernen.
"""

keywords_de = extract_keywords(german_text, top_n=5, language="de")

print("German keywords:")
for p in keywords_de:
    print(f"  {p.rank}. {p.text}: {p.score:.4f}")

German keywords:
  1. Learning verwendet neuronale Netze: 0.2865
  2. Maschinelles Lernen: 0.1130
  3. Technologie ermöglicht: 0.1130
  4. künstlichen Intelligenz: 0.1130
  5. Computern: 0.0866


In [8]:
# French example
french_text = """
L'apprentissage automatique est une branche de l'intelligence artificielle.
Les réseaux de neurones profonds permettent l'analyse de données complexes.
Ces technologies transforment de nombreux secteurs industriels.
"""

keywords_fr = extract_keywords(french_text, top_n=5, language="fr")

print("French keywords:")
for p in keywords_fr:
    print(f"  {p.rank}. {p.text}: {p.score:.4f}")

French keywords:
  1. neurones profonds permettent l'analyse: 0.2914
  2. secteurs industriels: 0.1250
  3. technologies transforment: 0.1250
  4. L'apprentissage automatique: 0.1130
  5. l'intelligence artificielle: 0.1130


## Optional: spaCy Integration

If you use spaCy, `rapid_textrank` can integrate as a pipeline component. This lets you use spaCy's superior tokenization with rapid_textrank's fast extraction.

First, install the required dependencies:

In [9]:
# Uncomment to install spaCy dependencies
%pip install -q spacy
import sys
!{sys.executable} -m spacy download en_core_web_sm

Note: you may need to restart the kernel to use updated packages.
Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
# spaCy integration example
try:
    import spacy
    import rapid_textrank.spacy_component  # Registers the pipeline factory
    
    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("rapid_textrank")
    
    doc = nlp(text)
    
    print("Keywords via spaCy pipeline:")
    for phrase in doc._.phrases[:5]:
        print(f"  {phrase.rank}. {phrase.text}: {phrase.score:.4f}")
        
except ImportError:
    print("spaCy not installed. This is optional - rapid_textrank works great on its own!")
except OSError:
    import sys
    print(f"spaCy model not found. Run: {sys.executable} -m spacy download en_core_web_sm")

Keywords via spaCy pipeline:
  1. machine learning: 0.2156
  2. Deep learning: 0.1674
  3. artificial intelligence: 0.1228
  4. neural networks: 0.1167
  5. systems: 0.0670


## TopicRank (Topic Clustering)

TopicRank groups similar keyphrases into topics and ranks the topics instead of individual phrases. 
For apples-to-apples comparison with pytextrank, this example uses spaCy tokens.


In [11]:
# TopicRank using spaCy tokens (shared tokenization)
try:
    import json
    import spacy
    from rapid_textrank import extract_from_json

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)

    tokens = []
    for sent_idx, sent in enumerate(doc.sents):
        for token in sent:
            tokens.append({
                'text': token.text,
                'lemma': token.lemma_,
                'pos': token.pos_,
                'start': token.idx,
                'end': token.idx + len(token.text),
                'sentence_idx': sent_idx,
                'token_idx': token.i,
                'is_stopword': token.is_stop,
            })

    payload = {
        'tokens': tokens,
        'variant': 'topic_rank',
        'config': {
            'top_n': 5,
            'language': 'en',
        },
    }

    result = json.loads(extract_from_json(json.dumps(payload)))

    print('TopicRank phrases (rapid_textrank):')
    for p in result['phrases'][:5]:
        print(f"  {p['rank']}. {p['text']}: {p['score']:.4f}")

except ImportError:
    print('spaCy not installed. Install it to run this example.')
except OSError:
    import sys
    print(f'spaCy model not found. Run: {sys.executable} -m spacy download en_core_web_sm')


TopicRank phrases (rapid_textrank):
  1. Machine learning: 0.4312
  2. Deep learning: 0.1674
  3. artificial intelligence: 0.1228
  4. neural networks: 0.1167
  5. systems: 0.0670


## Next Steps

Now that you know the basics, explore more in the other notebooks:

1. **[02_algorithm_variants.ipynb](02_algorithm_variants.ipynb)** - Deep dive into BaseTextRank, PositionRank, and BiasedTextRank
2. **[03_explain_algorithm.ipynb](03_explain_algorithm.ipynb)** - Visual explanation of how TextRank works
3. **[04_benchmarks.ipynb](04_benchmarks.ipynb)** - Performance comparison with pytextrank

### Resources

- [GitHub Repository](https://github.com/xang1234/rapid-textrank)
- [Original TextRank Paper](https://aclanthology.org/W04-3252/) (Mihalcea & Tarau, 2004)