<a href="https://colab.research.google.com/github/raz0208/ModernBERT/blob/main/ModernBERT_TokenEmbedding_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extract embedding form inpot text using ModernBERT Version 1

In [1]:
# Install Neo4j
!pip install neo4j

Collecting neo4j
  Downloading neo4j-5.28.1-py3-none-any.whl.metadata (5.9 kB)
Downloading neo4j-5.28.1-py3-none-any.whl (312 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.3/312.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-5.28.1


In [2]:
# import required libraries
import os
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel
from neo4j import GraphDatabase

### Load NLP and ModernBert models

In [None]:
# Load ModernBERT tokenizer and model from Hugging Face
MODEL_NAME = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

### Extract emmbedings based on full text

In [4]:
# Function to get inpout text and return full text embedding (Edit code to get embedding sentence by sentence)
def get_text_embedding(text):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # Forward pass to get hidden states
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the embeddings (use CLS token for sentence-level embedding)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # shape: [batch_size, hidden_size]

    return cls_embedding.squeeze().numpy()

## Use Neo4j to connect the graph database

In [5]:
# Define Neo4j connection credentials
NEO4J_URI = "neo4j://143.225.233.156:7687"
NEO4J_USER = "rezaazari"
NEO4J_PASSWORD = "rAzari987"

# Initialize the driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Function to test connection
def test_connection():
    with driver.session() as session:
        greeting = session.run("RETURN 'Connected to Neo4j' AS message").single()["message"]
        print(greeting)

if __name__ == "__main__":
    test_connection()

Connected to Neo4j


In [6]:
# Function to run Cypher query
def run_query(cypher_query, parameters=None):
    with driver.session() as session:
        result = session.run(cypher_query, parameters or {})
        return [record.data() for record in result]

# Query of showing example nodes
query = "MATCH (n) RETURN n LIMIT 5"
results = run_query(query)
for r in results:
    print(r)

{'n': {'date': '1-12-1987', 'journal': 'The Journal of Cell Biology', 'hub': 0.0, 'auth': 2.6175247320960168e-12, 'subjects': 'Articles', 'pmc': 'PMC2114721', 'abstract': 'Meiosis I in males of the Dipteran Sciara coprophila results in the nonrandom distribution of maternally and paternally derived chromosome sets to the two division products. Based on an earlier study (Kubai, D.F. 1982. J. Cell Biol. 93:655-669), I suggested that the meiosis I spindle does not play a direct role in the nonrandom sorting of chromosomes but that, instead, haploid sets are already separated in prophase nuclei well before the onset of spindle formation. Here I report more direct evidence that this hypothesis is true; this evidence was gained from ultrastructural reconstruction analyses of the arrangement of chromosomes in germ line nuclei (prophase nuclei in spermatogonia and spermatocytes) of males heterozygous for an X- autosome chromosome translocation. Because of this translocation, the maternal and p

In [7]:
# # Function to find similar nodes using cosine similarity ( Edit the code to use indexes to be faster and used single query to gest similar nodes)
# def find_similar_nodes(text_embedding, top_n):
#      embedding_list = text_embedding.tolist()
#      cypher_query = """
#      MATCH (n)-[:HAS_EMBEDDING]->(e:ABSTRACT)
#      WHERE e.embedding IS NOT NULL
#      WITH n, labels(n) AS labels, gds.similarity.cosine($sent_embedding, e.embedding) AS similarity
#      RETURN n, labels, similarity
#      ORDER BY similarity DESC
#      LIMIT $limit
#      """
#      parameters = {"sent_embedding": embedding_list, "limit": top_n}
#      results = run_query(cypher_query, parameters)
#      return results

In [8]:
# Function to find similar nodes using cosine similarity ( Edit the code to use indexes to be faster and used single query to gest similar nodes)
def find_similar_nodes(text_embedding, top_n):
     embedding_list = text_embedding.tolist()
     cypher_query = """
     CALL db.index.vector.queryNodes('abstractEmbeddings', $limit, $sent_embedding)
     YIELD node, score
     RETURN node AS n, labels(node) AS labels, score AS similarity
     ORDER BY similarity DESC
     """
     parameters = {"sent_embedding": embedding_list, "limit": top_n}
     results = run_query(cypher_query, parameters)
     return results

### Exacute the app and get output

In [9]:
### --- ### Sample text for test ### --- ###

# 1- This is an application about Breast Cancer.
# 2- Treating high blood pressure, high blood lipids, diabetes.
# 3- Heart failure, heart attack, stroke, aneurysm, peripheral artery disease, sudden cardiac arrest. Deaths: 17.9 million / 32% (2015)
# 4- Heart failure and stroke are common causes of death.

In [10]:
# Example usage (Sentence: This is an application about Breast Cancer.)
if __name__ == "__main__":
    user_text = input("Enter your text: ")

    # Get sentence embedding
    full_text_embedding = get_text_embedding(user_text)
    print("\nSentence Embedding vector shape:", full_text_embedding.shape)
    print("Sentence Embedding (first 10 values):", full_text_embedding[:10])

    # Call function to run similarity query
    similar_nodes = find_similar_nodes(full_text_embedding, top_n=5)

    # Show the result
    print(f"\nTop {len(similar_nodes)} similar nodes:")
    for node_data in similar_nodes:
      print(f"Node: {node_data['n']}, Similarity: {node_data['similarity']:.4f}")

Enter your text: Heart failure and stroke are common causes of death.

Sentence Embedding vector shape: (768,)
Sentence Embedding (first 10 values): [ 0.2938816  -1.0113076  -0.8573238  -0.06944127 -0.7596021  -0.7222282
 -1.1270422  -1.2861091   0.33987728 -0.522541  ]

Top 5 similar nodes:
Node: {'embedding': [0.6912307739257812, -0.5753567218780518, -0.0917813777923584, 0.21276627480983734, -0.39925625920295715, -0.2355521023273468, -0.4058747887611389, -0.4011824429035187, -0.0941164568066597, 0.2615385353565216, -0.24153250455856323, 0.11680153012275696, -0.3282451629638672, -0.3066999316215515, 0.6657626032829285, 0.12789304554462433, 0.05052574723958969, 0.8225725889205933, 0.15526734292507172, 0.30804526805877686, 0.35446250438690186, 0.1173621416091919, 0.3075098991394043, -0.4084547162055969, -0.2063133716583252, -0.06287381052970886, -0.363939493894577, 0.06281065195798874, -0.014805168844759464, 0.6888259053230286, -0.5144413113594055, -1.967010736465454, 0.1715601235628128

### Import extracted data to local dataset (Optional)

In [13]:
# Define Neo4j connection credentials
# --- Connect to Second Database ---
SECOND_NEO4J_URI = "neo4j+s://17c6383a.databases.neo4j.io"
SECOND_NEO4J_USER = "neo4j"
SECOND_NEO4J_PASSWORD = "MMJrt6Cc0cp4VQn6QJJphqXFUUytZOpc0ip1tErl-4U"
second_driver = GraphDatabase.driver(SECOND_NEO4J_URI, auth=(SECOND_NEO4J_USER, SECOND_NEO4J_PASSWORD))

# Function to test connection
def test_connection():
    with second_driver.session() as session:
        greeting = session.run("RETURN 'Connected to Neo4j' AS message").single()["message"]
        print(greeting)

if __name__ == "__main__":
    test_connection()

Connected to Neo4j


In [14]:
# Function to import extracted similar nodes into local graph database by implementing cypher query
def import_nodes_to_second_db(nodes_data):
    with second_driver.session() as session:
        for record in nodes_data:
            node_props = record['n']
            labels = record['labels']  # From the modified query

            # Build Cypher query
            label_string = ":".join(labels)
            prop_keys = ", ".join(f"{k}: ${k}" for k in node_props.keys())
            cypher_query = f"CREATE (n:{label_string} {{ {prop_keys} }})"

            session.run(cypher_query, node_props)

    print(f"✅ Imported {len(nodes_data)} nodes to the second Neo4j database.")

In [15]:
# Import to second Neo4j
import_nodes_to_second_db(similar_nodes)

✅ Imported 5 nodes to the second Neo4j database.


In [16]:
# Function to run Cypher query
def run_query(cypher_query, parameters=None):
    with second_driver.session() as session:
        result = session.run(cypher_query, parameters or {})
        return [record.data() for record in result]

# Query of showing example nodes
query = "MATCH (n) RETURN n LIMIT 5"
results = run_query(query)
for r in results:
    print(r)

{'n': {'date': '01-01-2021', 'journal': 'Epidemiology and Health', 'hub': 0.0, 'auth': 6.17622678094711e-49, 'subjects': 'Covid-19; Perspective', 'pmc': 'PMC8060517', 'abstract': 'As severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues to spread rapidly throughout the human population, the concept of “herd immunity” has attracted the attention of both decision-makers and the general public. In the absence of a vaccine, this entails that a large proportion of the population will be infected to develop immunity that would limit the severity and/or extent of subsequent outbreaks. We argue that adopting such an approach should be avoided for several reasons. There are significant uncertainties about whether achieving herd immunity is possible. If possible, achieving herd immunity would impose a large burden on society. There are gaps in protection, making it difficult to shield the vulnerable. It would defeat the purpose of avoiding harm caused by the virus. Lastly, dozen