information_extractor

Overview

information_extractor is a tool that leverages spaCy for coreference resolution and SpanBERT for relation extraction. This project integrates named entity recognition (NER) with relation extraction to identify and analyze relationships between entities in text.

Features

SpanBERT Model

Pre-trained model for relation extraction between entities
Supports multiple entity types (PERSON, ORGANIZATION, LOCATION, etc.)
Handles special token markers for subject and object entities
Uses BERT architecture for sequence classification
GPU acceleration support when available
Configurable batch size and sequence length

Entity Processing

Maps between spaCy and SpanBERT entity labels
Supports common entity types:
- Organizations (ORG)
- Persons (PERSON)
- Locations (GPE, LOC)
- Dates (DATE)
- And more

Relation Extraction

Creates entity pairs from spaCy sentences
Handles bidirectional relationships
Configurable confidence threshold
Deduplicates relations with confidence scoring
Returns structured relation tuples
Detailed logging for debugging

Pretrained Models

The assets directory contains the following pretrained models:

pretrained_spanbert/ finetuned for TACRED use cases.
corefereee_model_en from stanford research
en_core_web_md-3.50 from spaCy

Installation

To install and set up the project, run the following commands:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/rajatasusual/information_extractor.git
cd information_extractor
pip3 install -r requirements.txt
git lfs pull --include "assets/pretrained_spanbert/pytorch_model.bin"

Ensure that you have Git LFS installed to handle large model files.

Usage

To extract relations using spaCy and SpanBERT, you can run the provided example script:

python main.py

Example (Inside `main.py`)

import spacy
from spanbert_module import SpanBERT  # Import SpanBERT model

# Load spaCy NLP model
nlp = spacy.load("en_core_web_md")

# Sample text
text = "Bill Gates founded Microsoft. Microsoft is headquartered in Redmond."

# Process text with spaCy
doc = nlp(text)

# Load SpanBERT
pretrained_dir = "assets/pretrained_spanbert"
spanbert = SpanBERT(pretrained_dir=pretrained_dir)

# Extract relations
relations = spanbert.extract_relations(doc)
print(relations)

Acknowledgments

This project integrates SpanBERT from Facebook Research. If you use this project, please cite:

@article{joshi2019spanbert,
    title={{SpanBERT}: Improving Pre-training by Representing and Predicting Spans},
    author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
    journal={arXiv preprint arXiv:1907.10529},
    year={2019}
}

License & Disclaimer

This project is intended for research and educational purposes. The SpanBERT model belongs to Facebook Research, and its use must comply with their licensing terms. We are not affiliated with Facebook Research.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
assets		assets
iecode		iecode
.gitattributes		.gitattributes
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

information_extractor

Overview

Features

SpanBERT Model

Entity Processing

Relation Extraction

Pretrained Models

Installation

Usage

Example (Inside `main.py`)

Acknowledgments

License & Disclaimer

About

Releases 2

Languages

rajatasusual/information_extractor

Folders and files

Latest commit

History

Repository files navigation

information_extractor

Overview

Features

SpanBERT Model

Entity Processing

Relation Extraction

Pretrained Models

Installation

Usage

Example (Inside main.py)

Acknowledgments

License & Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Languages

Example (Inside `main.py`)