### LLM + RAG demonstration: Neurosymbolic AI and their application in Science

February 2026

This notebook demonstrates Retrieval-Augmented Generation (RAG) over a corpus of neurosymbolic AI and drug discovery papers, co-authored by [Tirtharaj Dash](https://tirtharajdash.github.io). This notebook may contain bugs, if so please let TD know. Also, this notebook is a part of the AI course (CS F407) only designed for demonstration.

**We have used the following models:**
- Encoder: `BAAI/bge-small-en-v1.5` (dense retrieval, 384-dim)
- Generator: `Qwen/Qwen2.5-3B-Instruct` (instruction-tuned, ~6 GB RAM)

To use a lighter model (CPU-only): set `GEN_MODEL = "HuggingFaceTB/SmolLM2-1.7B-Instruct"`  
To use a heavier model (better quality): set `GEN_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"`

**Libraries and packages to be installed before running this notebook**:
```python
#BE CAREFUL!!! IT MIGHT MESS YOUR ENV. I SUGGEST CREATE A NEW ENV and RUN THIS NOTEBOOK.\
#!pip install -q pymupdf sentence-transformers faiss-cpu transformers accelerate
```

In [1]:
import os
from pathlib import Path

PAPERS_DIR   = Path("Dash_NeuroSym_Paps")   # folder with our papers' PDFs
INDEX_FILE   = Path("neurosym.index")       # FAISS index (saved after first build)
CORPUS_FILE  = Path("neurosym_corpus.pkl")  # chunk text + metadata

CHUNK_WORDS  = 200    # words per chunk
OVERLAP      = 40     # word overlap between consecutive chunks

EMBED_MODEL  = "BAAI/bge-small-en-v1.5"  # 384-dim, fast, good quality
TOP_K        = 4                         # chunks to retrieve per query

GEN_MODEL    = "Qwen/Qwen2.5-3B-Instruct"   # good enough model
MAX_NEW_TOK  = 350

print(f"Papers directory : {PAPERS_DIR.resolve()}")
print(f"PDFs found       : {len(list(PAPERS_DIR.glob('*.pdf')))}")
print(f"Embed model      : {EMBED_MODEL}")
print(f"Generator model  : {GEN_MODEL}")

Papers directory : /home/tirtharaj/dash/CS-F407_Artificial-Intelligence/labs/Lab3/Dash_NeuroSym_Paps
PDFs found       : 9
Embed model      : BAAI/bge-small-en-v1.5
Generator model  : Qwen/Qwen2.5-3B-Instruct


**We need to first ingest PDFs and build FAISS index.**

Each PDF is split into overlapping word-level chunks. Each chunk is embedded by
`BAAI/bge-small-en-v1.5` and stored in a FAISS inner-product index (equivalent
to cosine similarity after L2 normalisation).

The index is saved to disk so this step runs only once.

In [2]:
import pymupdf as fitz                                  # PyMuPDF
import pickle
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm


def extract_chunks(pdf_path: Path, chunk_words: int, overlap: int):
    """Extract overlapping word-level chunks from a PDF."""
    doc  = fitz.open(str(pdf_path))
    text = " ".join(page.get_text() for page in doc)
    doc.close()

    words  = text.split()
    stride = chunk_words - overlap
    chunks, meta = [], []
    for i in range(0, len(words), stride):
        chunk = " ".join(words[i : i + chunk_words])
        if len(chunk.strip()) < 60:          # skip near-empty trailing chunks
            continue
        chunks.append(chunk)
        meta.append({"source": pdf_path.name, "word_offset": i})
    return chunks, meta


def build_index(papers_dir: Path, chunk_words: int, overlap: int,
                embed_model_name: str, index_file: Path, corpus_file: Path):
    pdfs = sorted(papers_dir.glob("*.pdf"))
    assert pdfs, f"No PDFs found in {papers_dir}"

    print(f"Ingesting {len(pdfs)} PDFs ...")
    corpus, metadata = [], []
    for pdf in tqdm(pdfs, desc="Parsing PDFs"):
        c, m = extract_chunks(pdf, chunk_words, overlap)
        corpus.extend(c)
        metadata.extend(m)
    print(f"Total chunks: {len(corpus)}")

    print(f"\nEncoding with {embed_model_name} ...")
    encoder    = SentenceTransformer(embed_model_name)
    embeddings = encoder.encode(
        corpus, batch_size=64, show_progress_bar=True,
        convert_to_numpy=True, normalize_embeddings=True  # L2-normalise -> cosine via IP
    )

    dim   = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)   # inner product on unit vectors = cosine similarity
    index.add(embeddings)

    faiss.write_index(index, str(index_file))
    with open(corpus_file, "wb") as f:
        pickle.dump((corpus, metadata), f)

    print(f"\nIndex saved to  : {index_file}")
    print(f"Corpus saved to : {corpus_file}")
    return index, corpus, metadata, encoder


# Build or load
if INDEX_FILE.exists() and CORPUS_FILE.exists():
    print("Loading existing index ...")
    index = faiss.read_index(str(INDEX_FILE))
    with open(CORPUS_FILE, "rb") as f:
        corpus, metadata = pickle.load(f)
    encoder = SentenceTransformer(EMBED_MODEL)
    print(f"Index loaded: {index.ntotal} vectors, dim={index.d}")
else:
    index, corpus, metadata, encoder = build_index(
        PAPERS_DIR, CHUNK_WORDS, OVERLAP, EMBED_MODEL, INDEX_FILE, CORPUS_FILE
    )

Loading existing index ...
Index loaded: 1094 vectors, dim=384


**Retrieval function**

Given a query string $q$, compute $\mathbf{e}_q = f_\phi(q)$ and return the
top-$k$ chunks by cosine similarity.

In [3]:
def retrieve(query: str, k: int = TOP_K):
    """
    Embed query and return top-k chunks with scores and source filenames.
    Returns a list of dicts: {chunk, source, score}
    """
    q_emb = encoder.encode(
        [query], convert_to_numpy=True, normalize_embeddings=True
    )
    scores, indices = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            "chunk":  corpus[idx],
            "source": metadata[idx]["source"],
            "score":  float(score)
        })
    return results


def show_retrieved(hits):
    for i, h in enumerate(hits):
        print(f"\n--- Chunk {i+1} | source: {h['source']} | score: {h['score']:.3f} ---")
        print(h["chunk"][:400], "..." if len(h["chunk"]) > 400 else "")


# doing here a sanity check
test_hits = retrieve("neurosymbolic drug discovery")
print(f"Retrieved {len(test_hits)} chunks for sanity check:")
for h in test_hits:
    print(f"  [{h['score']:.3f}] {h['source']}")

Retrieved 4 chunks for sanity check:
  [0.809] 2510.23379v1.pdf
  [0.790] 27751-Article Text-31805-1-2-20240324.pdf
  [0.787] 2510.23379v1.pdf
  [0.784] 2510.23379v1.pdf


In [4]:
# Generator LLM
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")

tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    GEN_MODEL,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto"
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=MAX_NEW_TOK,
    do_sample=False,      # greedy for reproducibility in class demo
    temperature=None,
    top_p=None
)
print(f"Generator loaded: {GEN_MODEL}")

Device: cuda


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Generator loaded: Qwen/Qwen2.5-3B-Instruct


**RAG inference**

Two functions:
- `llm_only(query)`: parametric generation, no retrieval
- `rag_answer(query)`: retrieve then generate

In [5]:
SYSTEM_PLAIN = (
    "You are a knowledgeable AI research assistant. "
    "Answer the question as accurately as you can."
)

SYSTEM_RAG = (
    "You are a precise research assistant. "
    "Answer the question using ONLY the provided context. "
    "If the answer is not in the context, say: 'Not found in the provided documents.' "
    "Always state which paper (source) supports your answer."
)


def build_prompt(system: str, context: str, query: str) -> str:
    """Build a chat-style prompt compatible with Qwen2.5 / Mistral / SmolLM2."""
    messages = [{"role": "system", "content": system}]
    user_content = f"{query}" if not context else (
        f"Context from research papers:\n\n{context}\n\nQuestion: {query}"
    )
    messages.append({"role": "user", "content": user_content})
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )


def extract_answer(raw_output: str, prompt: str) -> str:
    """Strip the prompt prefix from generator output."""
    return raw_output[len(prompt):].strip()


def llm_only(query: str) -> str:
    """Pure parametric generation — no retrieval."""
    prompt = build_prompt(SYSTEM_PLAIN, "", query)
    raw    = generator(prompt)[0]["generated_text"]
    return extract_answer(raw, prompt)


def rag_answer(query: str, k: int = TOP_K):
    """Retrieve top-k chunks, then generate a grounded answer."""
    hits    = retrieve(query, k)
    context = "\n\n".join(
        f"[Source: {h['source']} | relevance: {h['score']:.3f}]\n{h['chunk']}"
        for h in hits
    )
    prompt = build_prompt(SYSTEM_RAG, context, query)
    raw    = generator(prompt)[0]["generated_text"]
    return extract_answer(raw, prompt), hits


print("Functions defined: llm_only(), rag_answer()")

Functions defined: llm_only(), rag_answer()


**Test: LLM-only vs RAG-augmented**

For each query we show:
1. The LLM-only answer (parametric; may hallucinate)
2. The retrieved chunks (with source filenames and similarity scores)
3. The RAG-augmented answer (grounded; attributed)

In [6]:
DEMO_QUERIES = [
    "What symbolic representation is used for molecules in Dash et al.'s neurosymbolic framework?",
    "How is background knowledge encoded in the ILP-based drug discovery approach?",
    "What deep learning models Dash has worked on?"
]

SEP = "=" * 80

for query in DEMO_QUERIES:
    print(SEP)
    print(f"QUERY: {query}")
    print(SEP)

    print("\n[LLM-only answer]")
    print("-" * 40)
    ans_plain = llm_only(query)
    print(ans_plain)

    print("\n[Retrieved chunks]")
    print("-" * 40)
    ans_rag, hits = rag_answer(query)
    show_retrieved(hits)

    print("\n[RAG answer]")
    print("-" * 40)
    print(ans_rag)
    print()

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


QUERY: What symbolic representation is used for molecules in Dash et al.'s neurosymbolic framework?

[LLM-only answer]
----------------------------------------


The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


I apologize, but there seems to be an error in your question as no specific paper titled "Dash et al." with a neurosymbolic framework specifically about molecular representations has been widely recognized in the literature up to my last update in October 2023.

However, if you're referring to a neurosymbolic framework that deals with molecular representations, one common approach is to use chemical formulas or molecular structures as symbols. In computational models, these might be represented using SMILES (Simplified Molecular Input Line Entry System) or InChI (International Chemical Identifier) strings, which are standardized ways of encoding molecular structures.

If you have more context or details about the specific framework you're referring to, I'd be happy to provide a more accurate answer based on that information.

[Retrieved chunks]
----------------------------------------


The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Chunk 1 | source: 2510.23379v1.pdf | score: 0.834 ---
represented as a Grothendieck construction over an indexed family of partially-ordered sets. The construction yields a unified space linking symbolic and neural generation. • We implement and test SNGs on the real-world problem of generating potential in- hibitors for protein-targets. This constitutes the problem of ‘lead-discovery’, and re- alistic settings have the following charactestics. Given ...

--- Chunk 2 | source: 2510.23379v1.pdf | score: 0.832 ---
the implementation – instances generated – is also true of the specification (satisfies the constraints of the symbolic model). Unlike with classic formal methods, in SNG ML techniques play a role in obtaining both the specification and the implementation. We have proposed using the mathematical structure of fibered-posets by way of specifying the codomain of an SNG system. The paired elements com ...

--- Chunk 3 | source: 2510.23379v1.pdf | score: 0.825 ---
distribution.

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


There are several popular datasets used in the field of drug-likeness and bioactivity prediction. Here are some key ones:

1. PubChem BioAssay Datasets: These include a variety of bioactivity data from various assays, which can be used to predict drug-like properties.

2. ZINC Database: This is one of the most widely used databases for virtual screening and drug discovery. It contains over 1.3 billion commercially available small molecules.

3. Tox21 Datasets: These datasets come from the National Institute of Environmental Health Sciences (NIEHS) and the National Toxicology Program (NTP), providing a wide range of chemical structures and their biological activity.

4. ChEMBL: A large repository of bioactive compounds and their associated experimental data, including drug-like properties.

5. BindingDB: Contains experimentally determined protein-ligand binding affinities, which can be useful for predicting drug-like properties.

6. OpenTox: An open-source platform that provides a colle

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Chunk 1 | source: s10994-021-06090-8.pdf | score: 0.819 ---
to GNNs (Dash et al., 2021c). 600 Machine Learning (2022) 111:575–623 1 3 5.2 Materials 5.2.1 Data For the empirical evaluation of our proposed BotGNNs, we use 73 benchmark datasets arising in the field of drug-discovery. Each dataset represents extensive drug evaluation effort at the NCI14 to experimentally determine the effectiveness of anti-cancer activity of a compound against a number of ce ...

--- Chunk 2 | source: PhD_Thesis_Final.pdf | score: 0.784 ---
a number of cell- lines [MOHU03]. These datasets correspond to the concentration parameter GI50, which is the concentration that results in 50% growth inhibition of tumour cells. Some of the datasets have been used in various data mining studies, such as in a study involving the use of graph kernels in machine learning [RSSB05]. These datasets are also used in the study of LRNNs [SAZ+18, ˇSou20].  ...

--- Chunk 3 | source: 2021.07.09.451519v2.full.pdf | score: 0

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In an Inductive Logic Programming (ILP)-based drug discovery approach, background knowledge is typically encoded through logical rules or constraints that guide the learning process. Here's how it is usually implemented:

1. **Rule Encoding**: Background knowledge is first formalized into logical rules or constraints. These rules often represent biological knowledge, chemical properties, and disease mechanisms relevant to the drug discovery problem. For example, a rule might state that "a compound X interacts with protein Y if it has a certain binding affinity."

2. **Knowledge Base Construction**: The encoded rules are compiled into a knowledge base. This knowledge base serves as the training data for the ILP system, providing context and prior information about what constitutes a valid solution.

3. **Learning Algorithm**: An ILP algorithm then processes this knowledge base to learn patterns and relationships that generalize from the provided examples to new instances. The algorithm 

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Chunk 1 | source: PhD_Thesis_Final.pdf | score: 0.812 ---
novel molecules for early-stage drug design. We show how a graph-based neural network with domain-knowledge is one component in a modular system design. Specifically, our conclusions from this chapter are as follows: (1) We have constructed a complete end- to-end neural-symbolic system that is capable of generating active molecules that may not be in any existing database; (2) We have demonstrated ...

--- Chunk 2 | source: 2021.07.09.451519v2.full.pdf | score: 0.810 ---
Using Domain-Knowledge to Assist Lead Discovery in Early-Stage Drug Design Tirtharaj Dash1, Ashwin Srinivasan1, Lovekesh Vig2, and Arijit Roy3 1 Dept. of CSIS & APPCAIR, BITS Pilani, Goa Campus, India 2 TCS Research, New Delhi, India 3 TCS Innovation Labs (Life Sciences Division), Hyderabad, India Abstract. We are interested in generating new small molecules which could act as inhibitors of a biol ...

--- Chunk 3 | source: 27751-Article Text-31805-1-2-2024

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


I apologize, but there is no widely recognized or publicly available study that directly compares a "neurosymbolic model" to Graph Neural Network (GNN) baselines in terms of specific results. The term "neurosymbolic model" is not commonly used in the context of neural networks and graph learning, and it's unclear what kind of model this refers to.

Neurosymbolic models typically refer to approaches that combine neural networks with symbolic reasoning, which is quite different from GNNs. GNNs are specifically designed for tasks involving graph-structured data.

If you're looking for comparisons between GNNs and other types of neural network models, such as feedforward neural networks or recurrent neural networks, those comparisons are well-documented in the literature. However, these comparisons would not be directly comparable to a neurosymbolic model.

To get accurate results on how a neurosymbolic model performs compared to GNNs, you would need to:

1. Identify a specific neurosymbol

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Chunk 1 | source: PhD_Thesis_Final.pdf | score: 0.824 ---
for all variants of GNN. We examine the results in more detail: From Figure 4.3, it is evident that the per- formance of graph-based networks improves with the inclusion of domain-knowledge. A quantitative tabulation of wins, losses and draws is in Figure 4.4. These results again provide sufficient grounds to answer positively the primary research question addressed in this dissertation, namely: d ...

--- Chunk 2 | source: PhD_Thesis_Final.pdf | score: 0.797 ---
for training the VEGNNs (V EGNN1,...,5). The learning rate is 0.0005, weight decay parameter is 0.0001, the momentum factors are set to the default values of (β1, β2) = (0.9, 0.999). • The maximum number of training epochs is 1000. The batch size is 128. • We use an early-stopping mechanism [Pre98] to avoid overfitting during training. The resulting model is then saved and can be used for evaluati ...

--- Chunk 3 | source: PhD_Thesis_Final.pdf | score: 0.794 ---
v

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


I'm sorry, but there seems to be some confusion. There is no widely known entity called "Dash" that has been involved in deep learning model development. It's possible you might be thinking of Dash (company), which is a financial technology company, or Dash (software), which is a decentralized cryptocurrency. However, neither of these entities is typically associated with deep learning model development.

If you meant to ask about a specific individual named Dash who works on deep learning models, I would need more context to provide accurate information. If you could clarify the name or provide additional details, I'd be happy to assist further.

[Retrieved chunks]
----------------------------------------

--- Chunk 1 | source: PhD_Thesis_Final.pdf | score: 0.742 ---
machines”, International Con- ference on Artificial Neural Networks, 2019. URL: https://doi.org/10.1007/978-3-030-30484-3_3 6. T. Dash, A. Srinivasan, L. Vig, O.I. Orhobor, R.D. King, “Large-scale assessment of deep relat

In [7]:
query = input("Enter your question: ").strip()
if query:
    print(SEP)
    print("[LLM-only]")
    print(llm_only(query))

    print("\n[Retrieved chunks]")
    ans, hits = rag_answer(query)
    show_retrieved(hits)

    print("\n[RAG answer]")
    print(ans)

Enter your question:  What is neurosymbolic learning?


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[LLM-only]


The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Neurosymbolic learning is an emerging field that combines elements of neural networks (neuro) and symbolic AI (symbolic). This approach aims to leverage the strengths of both paradigms to create more robust, interpretable, and versatile AI systems.

Key aspects of neurosymbolic learning include:

1. Neural Networks: These provide the ability to learn complex patterns and relationships from data, which can be particularly useful for tasks where large amounts of data are available.

2. Symbolic AI: This involves using logical rules, knowledge bases, and other formal methods to represent and reason about the world. It provides a way to ensure that AI systems can make sense of their environment and act in ways that align with human understanding and values.

3. Integration: Neurosymbolic learning seeks to integrate these two approaches, allowing neural networks to handle the non-symbolic, data-driven aspects of learning while symbolic reasoning handles the symbolic, rule-based aspects.

4.

In [9]:
# a sanity checker
import pandas as pd
from collections import Counter

source_counts = Counter(m["source"] for m in metadata)
df = pd.DataFrame(source_counts.items(), columns=["paper", "chunks"])
df = df.sort_values("chunks", ascending=False).reset_index(drop=True)

print(f"Total papers  : {len(df)}")
print(f"Total chunks  : {df['chunks'].sum()}")
print(f"Avg chunks    : {df['chunks'].mean():.1f}")
print()
print(df.to_string(index=False))

Total papers  : 9
Total chunks  : 1094
Avg chunks    : 121.6

                                    paper  chunks
                     PhD_Thesis_Final.pdf     456
                   s10994-021-06090-8.pdf     135
                   s10994-023-06399-6.pdf     117
                         2510.23379v1.pdf     103
                   s41598-021-04590-0.pdf      73
                   s10994-021-05966-z.pdf      70
             2021.07.09.451519v2.full.pdf      50
                  978-3-030-30484-3_3.pdf      47
27751-Article Text-31805-1-2-20240324.pdf      43
