# Building Knowledge Graphs: REBEL, LlamaIndex, and REBEL + LlamaIndex



In this notebook, we'll explore how knowledge graphs are constructed using three approaches: REBEL, LlamaIndex, and a combination of REBEL + LlamaIndex. Our primary focus will be on evaluating the resulting triplets and their count from each method. While I've also demonstrated querying the knowledge graphs using LlamaIndex's Knowledge Graph Query Engine, our main emphasis remains on the building process.

In [None]:
!pip install python-dotenv
!pip install datasets
!pip install langchain
!pip install transformers
!pip install neo4j
!pip install llama-index
!pip install ipython-ngql nebula3-python networkx pyvis
!pip install torch
!pip install -U huggingface_hub

## Data Preparation

Here we're setting up essential tools and libraries. We'll use these for handling datasets, tokenizing input, managing text chunks, and working with sequence-to-sequence language models.

In [None]:
import os
import random
import json
import hashlib
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm.auto import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [None]:
validation_data, test_data = load_dataset("suolyer/pile_wikipedia", split=['validation', 'test'])

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/54.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/53.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
data = []
random_rows = random.sample(range(len(test_data)), 10)
build_data = [test_data[val]['text'] for val in random_rows]

In [None]:
build_data[0]

'Sébastien Pan\n\nSébastien Pan (born 9 July 1984) is a French composer and musician, best known for his work on motion picture and animated TV series.\n\nBiography \nBorn in Montbéliard, France, Sébastien Pan gained experience writing music for motion pictures, animated TV series and TV commercials at Imaginex Studios, an international award winning audio post-production house.\n\nBesides writing for live action movies and TV series, Sebastien began his collaboration with the director Wang YunFei in scoring the animation movie "Yugo & Lala" (aka Ava & Lala) in 2012, followed by "Yugo & Lala 2" in 2014 and "Kwai Boo, Crazy space adventure" in 2015. "Kwai Boo" marks the first time a Chinese animation project has received investment from a Hollywood giant, in this case 20th Century Fox.\n\nFilmography\n\nFilm\n\nTelevision\n\nTV Commercials\nPan also scored more than 30 international TV commercials and worked with renowned advertising agencies such as Saatchi and Saatchi, Leo Burnett Wor

In [None]:
max1 = -1
ans = ""
for val in build_data:
  if len(val.split()) > max1:
    max1 = len(val.split())
    ans = val
max1

1514

In [None]:
m = hashlib.md5()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
def bert_len(text):
    tokens = tokenizer.encode(text)
    return len(tokens)

def create_chunk_dataset(content):
      m.update(content.encode('utf-8'))
      uid = m.hexdigest()[:12]
      text_splitter = RecursiveCharacterTextSplitter(
          chunk_size = 400,
          chunk_overlap  = 40,
          length_function = bert_len,
          separators=['\n\n', '\n', ' ', ''],
      )
      chunks = text_splitter.split_text(content)
      for i, chunk in enumerate(chunks):
          data.append({
              'id': f'{uid}-{i}',
              'text': chunk
          })

for dt in build_data:
    create_chunk_dataset(dt)

In [None]:
filename = '../data/knowledge graphs/rebel_llamaindex/wiki_chunks.jsonl'
# save
with open(filename, 'w') as outfile:
    for x in data:
        outfile.write(json.dumps(x) + '\n')

# load
# data = []
# with open(filename, 'r') as f:
#     for line in tqdm(f):
#         val = json.loads(line)
#         data.append(val)

## REBEL: Relation Extraction By End-to-end Language generation

Here we extract relation triplets from given text using the REBEL model. A utility function extract_triplets  is defined to parse the model's output and extract relation triplets.  Also the tokenizer and model are initialized from Babelscape/rebel-large.

In [None]:
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [None]:
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 1,
}

triples = []

In [None]:
def generate_triples(texts):

  model_inputs = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
  generated_tokens = model.generate(
      model_inputs["input_ids"].to(model.device),
      attention_mask=model_inputs["attention_mask"].to(model.device),
      **gen_kwargs
  )
  decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
  for idx, sentence in enumerate(decoded_preds):
      et = extract_triplets(sentence)
      for t in et:
        triples.append((t['head'], t['type'], t['tail']))

for i in tqdm(range(0, len(data), 2)):
  try:
    texts = [data[i]['text'], data[i+1]['text']]
  except:
    texts = [data[i]['text']]
  generate_triples(texts)

  0%|          | 0/12 [00:00<?, ?it/s]

In [None]:
distinct_triples = list(set(triples))

In [None]:
# save
with open('../data/knowledge graphs/rebel_llamaindex/rebel_triples.json', 'w') as file:
    json.dump(distinct_triples, file)

# load
with open('../data/knowledge graphs/rebel_llamaindex/rebel_triples.json', 'r') as file:
    loaded_triples = json.load(file)

In [None]:
loaded_triples[:5]

[['Edward III', 'child', 'John of Gaunt'],
 ['Playing God', 'cast member', 'David Duchovny'],
 ["Union–Republican People's Commissariat of the Armed Forces of the Soviet Union",
  'replaces',
  "People's Commissariat of the Navy of the Soviet"],
 ['Somerset County, Pennsylvania',
  'located in the administrative territorial entity',
  'Pennsylvania'],
 ['1860 United States presidential election', 'candidate', 'Abraham Lincoln']]

In [None]:
len(loaded_triples)

43

## Nebula Graph

NebulaGraph is an open-source distributed, scalable, and high-performance graph database designed to manage vast amounts of interconnected data. NebulaGraph has been widely used for social media, recommendation systems, knowledge graphs, security, capital flows, fraud detection, AI, etc.

In [None]:
from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

import logging
import sys

logging.basicConfig(
    stream=sys.stdout, level=logging.INFO
)  # logging.DEBUG for more verbose output

from llama_index import (
    KnowledgeGraphIndex,
    LLMPredictor,
    ServiceContext,
    SimpleDirectoryReader,
)
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore
from llama_index.llms import OpenAI

from IPython.display import Markdown, display


# define LLM
# NOTE: at the time of demo, text-davinci-002 did not have rate-limit errors
llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size_limit=512)



In [None]:
os.environ["NEBULA_USER"] = "root"
os.environ["NEBULA_PASSWORD"] = "nebula"  # default is "nebula"
os.environ[
    "NEBULA_ADDRESS"
] = "127.0.0.1:9669"

space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]

In [None]:
graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

## LlamaIndex 🦙

LlamaIndex is an open-source project designed to facilitate in-context learning. The toolkit offers data loaders that serialize diverse knowledge sources like PDFs, Wikipedia pages, and Twitter into a standardized format, eliminating the need for manual preprocessing. With a single code line, LlamaIndex aids in generating and storing embeddings, be it in memory or vector databases. In addition to VectorStoreIndex we have KnowledgeGraphIndex which automates the construction of knowledge graphs from raw text and enables precise entity-based querying. This capability enhances search efficiency, especially in contexts requiring broader, cross-node information.

Next, the data is loaded into the system using LlamaIndex's SimpleDirectoryReader, which reads documents from a specified directory.

In [None]:
from llama_index import SimpleDirectoryReader

In [None]:
reader = SimpleDirectoryReader(input_dir="../data/knowledge graphs/rebel_llamaindex/wiki/")
documents = reader.load_data()
print(f"Loaded {len(documents)} docs")

Loaded 23 docs


A Knowledge Graph index, kg_index, is then constructed using these documents. For each document, a maximum of 5 triplets is extracted. The include_embeddings=True  parameter ensures that semantic embeddings of the knowledge graph's  nodes and edges are also included in the index, facilitating  semantically-driven queries in the future.

In [None]:
kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    max_triplets_per_chunk=5,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

After connecting to the local instance of NebulaGraph, now we query  the llamaindex  space using Nebula Graph Query Language and fetch ten relationship triplets from the constructed knowledge  graph and also count all such relationships present in the database.

In [None]:
%load_ext ngql
%ngql --address 127.0.0.1 --port 9669 --user root --password nebula

Connection Pool Created
INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,Name
0,llamaindex


In [None]:
%ngql USE llamaindex;

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


In [None]:
%ngql MATCH (m)-[e]->(n) RETURN m.entity.name AS m_entity,e.relationship AS relationship,n.entity.name AS n_entity LIMIT 10

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,m_entity,relationship,n_entity
0,Yale Law School faculty,is category of,people
1,Tyler and Huo,2002,is based on surveys of people in different eth...
2,Tom R. Tyler,is,author or co-author of 9 books
3,Tom R. Tyler,is,known for contributions to understanding why p...
4,Tom R. Tyler,is,professor of psychology and law
5,Time to Get Alone,is song written by,Brian Wilson
6,Time to Get Alone,is produced by,Carl Wilson
7,Time to Get Alone,is released on,20/20
8,Sébastien Pan,best known for,work on motion picture and animated TV series
9,Sébastien Pan,gained experience writing music for,motion pictures


In [None]:
%ngql MATCH (m)-[e]->(n) RETURN COUNT(*)

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,COUNT(*)
0,92


In [None]:
from llama_index.query_engine import KnowledgeGraphQueryEngine

from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

Let's now run a simple query.

In [None]:
response = query_engine.query(
    "Tell me about Sébastien Pan?",
)
display(Markdown(f"<b>{response}</b>"))

[1;3;33mGraph Store Query:
```
MATCH (p:`entity`)-[:relationship]->(m:`entity`) WHERE p.`entity`.`name` == 'Sébastien Pan'
RETURN m.`entity`.`name`;
```
[0m[1;3;33mGraph Store Response:
{'m.entity.name': ['work on motion picture and animated TV series', 'motion pictures', 'composer', 'musician']}
[0m[1;3;32mFinal Response: 

Sébastien Pan is a composer and musician who works on motion pictures and animated TV series.
[0m

<b>

Sébastien Pan is a composer and musician who works on motion pictures and animated TV series.</b>

In [None]:
graph_query = query_engine.generate_query(
    "Tell me about Sébastien Pan?",
)

graph_query = graph_query.replace("WHERE", "\n  WHERE").replace("RETURN", "\nRETURN")

display(
    Markdown(
        f"""
```cypher
{graph_query}
```
"""
    )
)


```cypher
```
MATCH (p:`entity`)-[:relationship]->(m:`entity`) 
  WHERE p.`entity`.`name` == 'Sébastien Pan'

RETURN m.`entity`.`name`;
```
```


In [None]:
%%ngql
MATCH (p:`entity`)-[e:relationship]->(m:`entity`)
  WHERE p.`entity`.`name` == 'Sébastien Pan'
RETURN p.`entity`.`name`, e.relationship, m.`entity`.`name`;

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,p.entity.name,e.relationship,m.entity.name
0,Sébastien Pan,best known for,work on motion picture and animated TV series
1,Sébastien Pan,gained experience writing music for,motion pictures
2,Sébastien Pan,is,composer
3,Sébastien Pan,is,musician


## REBEL + LlamaIndex 🦙

Now, let's establish a new space rebel_llamaindex, which leverages the capabilities of both REBEL and LlamaIndex to build a knowledge graph.

In [None]:
os.environ["NEBULA_USER"] = "root"
os.environ["NEBULA_PASSWORD"] = "nebula"  # default is "nebula"
os.environ[
    "NEBULA_ADDRESS"
] = "127.0.0.1:9669"

space_name = "rebel_llamaindex"
edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]

In [None]:
graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

In [None]:
from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

In [None]:
def extract_triplets(input_text):
    text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(input_text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])[0]

    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append((subject.strip(), relation.strip(), object_.strip()))

    return triplets

In [None]:
rebel_kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    kg_triplet_extract_fn=extract_triplets,
    storage_context=storage_context,
    max_triplets_per_chunk=5,
    service_context=service_context,
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
    include_embeddings=True,
)

In [None]:
%load_ext ngql
%ngql --address 127.0.0.1 --port 9669 --user root --password nebula

The ngql extension is already loaded. To reload it, use:
  %reload_ext ngql
Connection Pool Created
INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,Name
0,rebel_llamaindex


In [None]:
%ngql USE rebel_llamaindex;

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


In [None]:
%ngql MATCH (m)-[e]->(n) RETURN m.entity.name AS m_entity,e.relationship AS relationship,n.entity.name AS n_entity LIMIT 10

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,m_entity,relationship,n_entity
0,WiiWare,has part,WiiWare games (North America)
1,Union–Republican,replaces,People's Commissariat of Defense of the Soviet...
2,head,type,tail
3,Tom Tyler,date of birth,"March 3, 1950"
4,Sir Nathan Wright,position held,Lord Keeper of the Great Seal
5,Savoy Hotel,located in or next to body of water,River Thames
6,Savoy Hospital,inception,1512
7,Union–Republican People's Commissariat of the ...,replaces,People's Commissariat of the Navy of the Soviet
8,"Somerset County, Pennsylvania",located in the administrative territorial entity,Pennsylvania
9,Ministry of Defense of the Russian Federation,replaces,Ministry of Defense of the Soviet Union


In [None]:
%ngql MATCH (m)-[e]->(n) RETURN COUNT(*)

INFO:nebula3.logger:Get connection to ('127.0.0.1', 9669)


Unnamed: 0,COUNT(*)
0,22


In [None]:
from llama_index.query_engine import KnowledgeGraphQueryEngine

from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

query_engine = KnowledgeGraphQueryEngine(
    storage_context=storage_context,
    service_context=service_context,
    llm=llm,
    verbose=True,
)

In [None]:
response = query_engine.query(
    "Tell me about Savoy Hotel?",
)
display(Markdown(f"<b>{response}</b>"))

[1;3;33mGraph Store Query:

```
MATCH (e:`entity`)-[:relationship]->(h:`entity`) WHERE e.`entity`.`name` == 'Savoy Hotel'
RETURN h.`entity`.`name`;
```
[0m[1;3;33mGraph Store Response:
{'h.entity.name': ['River Thames']}
[0m[1;3;32mFinal Response: 

The Savoy Hotel is located on the River Thames.
[0m

<b>

The Savoy Hotel is located on the River Thames.</b>