textbook keyword extraction evaluation

this project evaluates the performance of keybert for keyword extraction by comparing extracted keywords against a textbook's index. it's designed to assess how well automated keyword extraction matches human-curated keywords.

project structure

.
├── README.md
├── requirements.txt
├── jupyter_trainer.py       # primary implementation for training/evaluation
├── keybert_trainer.py       # deprecated implementation (kept for reference)
├── finetune/              # code for fine-tuning the sentence transformer model
│   ├── config.py          # fine-tuning configuration settings
│   ├── data_prep.py       # data preparation for fine-tuning
│   └── train.py          # main fine-tuning script
├── mindmap_generator.py     # creates interactive knowledge graph visualizations
├── index_mindmap_generator.py # creates ground truth mindmaps from the index
├── textbook/               
│   ├── ch1.txt - ch19.txt  # textbook chapters
│   ├── index.txt           # complete textbook index
│   └── index_by_chapter.txt # index organized by chapter
├── results/                # evaluation results directory
│   ├── mindmaps/          # interactive html mindmaps
│   └── *.png, *.json      # final results and visualizations
├── cache/                  # cached data for faster processing
├── models/                # directory for saving fine-tuned models
└── checkpoints/           # interim directory during training

setup

create a virtual environment:

python3 -m venv venv
source venv/bin/activate

install dependencies:

pip install -r requirements.txt

running the scripts

keyword extraction and evaluation

# Extract keywords with optimized parameters
python jupyter_trainer.py

configurable parameters in jupyter_trainer.py:

top_n: number of keywords to extract (default: 75)
diversity: diversity factor for keyword selection (default: 0.6)
EVAL_SIMILARITY_THRESHOLD: threshold for keyword matching (default: 0.75)

the script will:

process all training chapters (1-5, 7-9, 13-19)
extract keywords using the fine-tuned model (mpnet_textbook_tuned)
combine neural and statistical approaches for better extraction
evaluate against test chapters (6, 10, 11, 12)
save results to the results directory
cache processed data and embeddings for faster subsequent runs
generate a metrics plot alongside the JSON results

Note: Parameter values may vary from those documented as we continue to optimize results. The implementation in jupyter_trainer.py represents our current best approach, while keybert_trainer.py is kept for reference only.

fine-tuning sentence transformers

to further improve keyword extraction, you can fine-tune the underlying sentence transformer model on your textbook content:

# Fine-tune the sentence transformer model using triplet loss
python -m finetune.train

# After fine-tuning, the model will be available at:
# models/mpnet_textbook_tuned

the fine-tuning process:

processes all training chapters (1-5, 7-9, 13-19).
locates ground truth keywords in paragraph contexts.
creates triplet examples (context, positive keyword, negative keyword).
trains the model using triplet loss to learn domain-specific relationships.

configurable parameters (in finetune/config.py):

BASE_MODEL_NAME: base model to fine-tune (default: 'sentence-transformers/all-mpnet-base-v2')
BATCH_SIZE: batch size for training (default: 16)
EPOCHS: number of training epochs (default: 3)
LEARNING_RATE: learning rate for fine-tuning (default: 2e-5)

requirements:

requires pytorch and the accelerate library
gpu recommended for faster training

generating knowledge graph visualizations

after extracting keywords, you can create interactive mindmaps that visualize the relationships between keywords:

# Generate mindmap from extracted keywords
python mindmap_generator.py --chapters 6 10 11 12 --results results/evaluation_TIMESTAMP.json --output results/mindmaps/keybert_mindmap.html

# Generate ground truth mindmap from index
python index_mindmap_generator.py --chapters 6 10 11 12 --output results/mindmaps/ground_truth_mindmap.html

options:

--chapters: chapter numbers to include in the mindmap (e.g., --chapters 6 10 11 12)
--results: path to evaluation results file (generated from jupyter_trainer.py)
--model: sentence transformer model to use (default: all-mpnet-base-v2)
--similarity: threshold for connecting keywords (0-1) (default: 0.65)
--max_keywords: maximum number of keywords to include (default: 150)
--min_edge_weight: minimum weight for edges (default: 0.6)
--output: output html file path
--title: title for the visualization
--no_cache: disable embedding cache
--quiet: suppress verbose output

how it works:

loads extracted keywords from evaluation results
creates embeddings for each keyword using sentence transformers
builds a knowledge graph where:
- nodes = keywords
- edges = semantic relationships (based on cosine similarity)
- chapters = hub nodes with distinct colors
organizes keywords by chapter (represented as diamond nodes)
exports an interactive html visualization using pyvis

the visualization allows:

interactive exploration of keyword relationships
zooming and panning
color-coded keywords by chapter
hover information with details
physics-based layout where related keywords naturally group together

implementation details

keyword extraction

uses keybert with sentence-transformer model:
- fine-tuned mpnet_textbook_tuned model (defaults to all-mpnet-base-v2 if not available)
extracts keywords per chapter (75-100 configurable)
uses n-grams (1-3 words) to capture phrases
implements diversity in keyword selection (0.6-0.7)
filters out very short keywords (≤2 characters)
sentence chunking: splits text into sentence-based chunks using regex-based splitter
hybrid approach: combines neural (transformer) and statistical (TF-IDF) methods:
- Filters candidates using is_valid_keyword before merging
- Combines scores with customizable weighting
paragraph-aware processing: treats paragraphs as separate documents for TF-IDF calculation
embedding caching: caches keyword embeddings to improve performance

evaluation metrics

precision: ratio of correctly identified keywords to total predicted keywords
recall: ratio of correctly identified keywords to total actual keywords
f1 score: harmonic mean of precision and recall
ground truth: evaluates test chapter keywords against actual index keywords for that chapter (loaded from index_by_chapter.txt)
similarity threshold: 0.75 (optimized through experimentation)

keyword matching

the evaluation implements advanced methods for matching keywords:

embedding-based similarity:
- uses transformer model to create vector representations
- calculates cosine similarity between keyword vectors
- captures semantic relationships between terms
- better handles synonyms and related concepts
- configurable similarity threshold (default: 0.75)
string-based matching:
- normalization (lowercase, remove special characters)
- exact matches after normalization (score: 1.0)
- contained terms (score: 0.9)
- fuzzy string matching using sequence matcher
hierarchical matching:
- specialized matching for hierarchical terms in indices
- handles variations in term order and phrasing
- combines string and semantic methods

visualization features

chapter-based organization:
- keywords organized around chapter hub nodes
- each chapter has a distinct color
- chapter nodes shown as diamonds with larger font
- keywords connected to their chapter with dashed lines
index-based ground truth:
- index_mindmap_generator.py creates ground truth mindmaps
- visualizes index terms organized by chapter
- allows direct comparison with extracted keywords
semantic connections:
- keywords connected based on semantic similarity
- similar concepts cluster together visually
- edge thickness indicates strength of relationship

training vs testing split

the evaluation is designed to work with specific chapter splits:

training chapters: 1, 2, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 17, 18, 19
test chapters: 6, 10, 11, 12

this split allows for evaluating the model's performance on both seen and unseen content.

implemented improvements

optimized extraction volume:
- extracts 75-100 keywords per chapter (tuned for precision/recall balance)
- keeps up to 300 unique training keywords
- better coverage of relevant terms
enhanced diversity:
- diversity parameter tuned between 0.6-0.7
- captures a wider range of concepts
- reduces redundancy in extracted terms
length filtering:
- filters out very short keywords (≤2 characters)
- improves precision by removing common short words
- reduces false positives
optimized similarity matching:
- similarity threshold tuned to 0.75
- balances precision and recall
- better identifies related concepts
chapter-based visualization:
- organizes mindmaps by chapter instead of clusters
- provides clearer structure and context
- makes comparison between extracted and index terms easier
ground truth visualization:
- added index-based mindmap generation
- provides baseline for comparison
- helps assess extraction quality visually
improved extraction pipeline:
- switched to sentence-based chunking for better semantic coherence
- tuned hybrid extraction to filter noise and prioritize neural results
- enhanced stopword list to remove common non-keywords
model fine-tuning:
- implemented capability to fine-tune the sentence transformer model on textbook content
- uses triplet loss to learn domain-specific relationships between context and keywords
- created a custom model that better understands the textbook's terminology and style

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cache		cache
finetune		finetune
models		models
results		results
textbook		textbook
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
IEEE Template.tex		IEEE Template.tex
Project Midterm Report.docx		Project Midterm Report.docx
README.md		README.md
fix_hyphens.py		fix_hyphens.py
fix_keywords.py		fix_keywords.py
index_mindmap_generator.py		index_mindmap_generator.py
jupyter_trainer.py		jupyter_trainer.py
keybert_evaluation.png		keybert_evaluation.png
keybert_evaluation_results.json		keybert_evaluation_results.json
keybert_trainer.py		keybert_trainer.py
keybert_trainer.py.backup		keybert_trainer.py.backup
keybert_unified.py		keybert_unified.py
midterm_report.md		midterm_report.md
mindmap_generator.py		mindmap_generator.py
requirements.txt		requirements.txt
test.py		test.py
test_keyword_regex.py		test_keyword_regex.py
textbook.zip		textbook.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textbook keyword extraction evaluation

project structure

setup

running the scripts

keyword extraction and evaluation

fine-tuning sentence transformers

generating knowledge graph visualizations

implementation details

keyword extraction

evaluation metrics

keyword matching

visualization features

training vs testing split

implemented improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

textbook keyword extraction evaluation

project structure

setup

running the scripts

keyword extraction and evaluation

fine-tuning sentence transformers

generating knowledge graph visualizations

implementation details

keyword extraction

evaluation metrics

keyword matching

visualization features

training vs testing split

implemented improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages