information_extractor is a tool that leverages spaCy for coreference resolution and SpanBERT for relation extraction. This project integrates named entity recognition (NER) with relation extraction to identify and analyze relationships between entities in text.
- Pre-trained model for relation extraction between entities
- Supports multiple entity types (PERSON, ORGANIZATION, LOCATION, etc.)
- Handles special token markers for subject and object entities
- Uses BERT architecture for sequence classification
- GPU acceleration support when available
- Configurable batch size and sequence length
- Maps between spaCy and SpanBERT entity labels
- Supports common entity types:
- Organizations (ORG)
- Persons (PERSON)
- Locations (GPE, LOC)
- Dates (DATE)
- And more
- Creates entity pairs from spaCy sentences
- Handles bidirectional relationships
- Configurable confidence threshold
- Deduplicates relations with confidence scoring
- Returns structured relation tuples
- Detailed logging for debugging
The assets
directory contains the following pretrained models:
- pretrained_spanbert/ finetuned for TACRED use cases.
- corefereee_model_en from stanford research
- en_core_web_md-3.50 from spaCy
To install and set up the project, run the following commands:
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/rajatasusual/information_extractor.git
cd information_extractor
pip3 install -r requirements.txt
git lfs pull --include "assets/pretrained_spanbert/pytorch_model.bin"
Ensure that you have Git LFS installed to handle large model files.
To extract relations using spaCy and SpanBERT, you can run the provided example script:
python main.py
import spacy
from spanbert_module import SpanBERT # Import SpanBERT model
# Load spaCy NLP model
nlp = spacy.load("en_core_web_md")
# Sample text
text = "Bill Gates founded Microsoft. Microsoft is headquartered in Redmond."
# Process text with spaCy
doc = nlp(text)
# Load SpanBERT
pretrained_dir = "assets/pretrained_spanbert"
spanbert = SpanBERT(pretrained_dir=pretrained_dir)
# Extract relations
relations = spanbert.extract_relations(doc)
print(relations)
This project integrates SpanBERT from Facebook Research. If you use this project, please cite:
@article{joshi2019spanbert,
title={{SpanBERT}: Improving Pre-training by Representing and Predicting Spans},
author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
journal={arXiv preprint arXiv:1907.10529},
year={2019}
}
This project is intended for research and educational purposes. The SpanBERT model belongs to Facebook Research, and its use must comply with their licensing terms. We are not affiliated with Facebook Research.