In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Constructing a Knowledge Graph from Research Publications

1. Introduction to Knowledge Graph

2. Why Are Knowledge Graphs Useful in Literature Study?

3. How to build a knowledge graph
  - Set up a Neo4j instance with Neo4j and Cypher
  - Extract Metadata from research paper
  - Knowledge Graph Setup
  - Vector embedding
  - Ask questions
  - Visualize Knowledge Graph
  - Writing Cypher with an LLM

## 1. Introduction to Knowledge Graphs:

Graphs are great at representing and storing heterogeneous and interconnected information in a structured manner, effortlessly capturing complex relationships and attributes across diverse data types. In contrast, vector databases often struggle with such structured information, as their strength lies in handling unstructured data through high-dimensional vectors. (https://medium.com/neo4j/enhancing-the-accuracy-of-rag-applications-with-knowledge-graphs-ad5e2ffab663)


<image src= 'https://media.licdn.com/dms/image/v2/C5612AQEt5xJTvQ0HDg/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1639491776767?e=1749686400&v=beta&t=O9kUhJuqtm1HEPdrsWHhIkKVlQ_-TKSgyLyzt2vCyvU'>


Image source: https://www.linkedin.com/pulse/getting-intelligent-answers-from-knowledge-graphs-peter-lawrence/

## 2. Why Are Knowledge Graphs Useful in Literature Study?

Connected Papers:
A connected graph approach that shows similarity between publications, even when there's no direct citation, helping to infer topic similarity. **Highlights the most relevant and essential papers.**

Bibliometric Networks:
Maps and constructs networks based on co-authorship or co-occurrence. **Uses distance-based visualization to indicate similarity between publications. Helps identify high-impact publications in a field.**

Concept Structuring and Clustering:
Tools like SciKGraph automate the structuring of concepts/topics within a domain. **Uses semantic-based analysis and NLP-driven classification. High accuracy in organizing scientific content.**

Moreover, when we use RAG for literature study, RAG using only vector search has limitations. It misses out on context, can't grasp relationships between data points, and often produces unreliable results. When your app needs to understand the connections between author, topic, and citations, basic RAG falls short. You can end up with hallucinations or irrelevant, unexplainable results. Combining knowledge graph and traditional RAG aporaches can better solve this issue.

## 3. How to build a knowledge graph

**Set up a Neo4j instance.**  
   This will serve as the platform for building and exploring our knowledge graph.

**Process the data.**  
   We'll extract key metadata—such as authors and keywords (topics)—from a sample of 50 research papers. Each paper will also be split into smaller chunks for more granular analysis.

**Build the knowledge graph.**  
   We'll add nodes representing papers, chunks, authors, and topics, and define relationships between them.  
   Each chunk will be embedded using a vector embedding model. These embeddings will be stored in a vector store, enabling efficient semantic search and question answering.

**Query with Cypher using an LLM.**  
   We’ll use a large language model (LLM) to help generate Cypher queries, so you don’t need to write them manually. Simply ask your question in natural language, and the LLM will translate it into Cypher.



### Set up a Neo4j instance with Neo4j and Cypher

#### What is Neo4j

**Neo4j** is a **graph database**. Unlike traditional relational databases that use tables, Neo4j stores data in **nodes** and **relationships**.

This model is ideal for use cases such as:
- Social networks
- Recommendation engines
- Knowledge graphs
- Fraud detection

| Concept        | Description                             | Example                          |
|----------------|-----------------------------------------|----------------------------------|
| **Node**       | An entity or object                     | A person, a product, a location  |
| **Relationship** | A connection between nodes            | 'FRIENDS_WITH', 'PURCHASED'      |
| **Label**      | A category or type assigned to nodes    | ':Person', ':Movie'             |
| **Property**   | Key-value pair on a node/relationship   | 'name: "Alice"', 'age: 30'       |



#### What is Cypher?

**Cypher** is Neo4j’s query language. It's designed to work with **graph patterns** and is intuitive, much like SQL is for relational databases.




### Neo4j instance
In this workshop, we will provide a Neo4j instance via api_keys.txt.

If you would like to use your own Neo4j instance, the easiest way is to start a on Neo4j Aura (https://neo4j.com/product/auradb/), which offers cloud instances of Neo4j database. Alternatively, you can also set up a local instance of the Neo4j database by downloading the Neo4j Desktop application (https://neo4j.com/download/) and creating a local database instance.


Please remember to keep the private key

**Neo4j Desktop uses Graph Apps and other web content. Some of these are provided by the community and, like any other software you install, could potentially cause data integrity and security issues. If you're working with sensitive data we recommend that you perform an independent security audit before using it.**

Loading api keys from '/content/drive/MyDrive/Colab_Notebooks/AI/api_keys.txt'

In [2]:
import os

api_keys_path = '/content/drive/MyDrive/Colab_Notebooks/AI/api_keys.txt'

with open(api_keys_path) as f:
    for line in f:
        key, value = line.strip().split('=')
        os.environ[key] = value

If you prefer to save the informaiton to Google Secrets instead of
using the api_keys.txt file, please uncomment the following code.

In [3]:
# from google.colab import userdata
# import os

# # Securely access the API key from Colab's Secrets
# try:
#     openai_api_key = userdata.get('OPENAI_API_KEY')
# except Exception as e:
#     print("Error: Please add your OpenAI API key to Colab Secrets")
#     print("Steps: 1. Click the 'key' icon in the left panel")
#     print("       2. Add a secret named OPENAI_API_KEY with your API key")
#     raise e

# os.environ['OPENAI_API_KEY'] = openai_api_key
# os.environ["NEO4J_URI"] = userdata.get('NEO4J_URI')
# os.environ["NEO4J_USERNAME"] = userdata.get('NEO4J_USERNAME')
# os.environ["NEO4J_PASSWORD"] = userdata.get('NEO4J_PASSWORD')


### Install packages

In [4]:
!pip install -q langchain langchain-community langchain-openai langchain-experimental neo4j tiktoken yfiles_jupyter_graphs PyPDF2 faiss-cpu sentence-transformers transformers python-dotenv requests torch pypdf PyPDF2 pdfplumber keybert

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.3/65.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.3/312.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
import os
import textwrap
import getpass
import logging
import time

# Langchain and vector embeddings
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate

# Import KeyBERT for better phrase extraction
from keybert import KeyBERT

import pdfplumber
import hashlib
from typing import Dict, List
import re
import shutil
import glob
import json

from neo4j.exceptions import ServiceUnavailable, IncompleteCommit

import warnings
warnings.filterwarnings("ignore")

In [6]:
os.environ["NEO4J_URI"],

('neo4j+s://f21edcc2.databases.neo4j.io',)

In [7]:
import os, socket, subprocess, pprint, re

def check_env():
    uri = os.getenv("NEO4J_URI")
    user = os.getenv("NEO4J_USERNAME")
    pwd = os.getenv("NEO4J_PASSWORD")
    print("🔎 Environment values")
    print("  NEO4J_URI     :", repr(uri))
    print("  NEO4J_USERNAME:", repr(user))
    print("  NEO4J_PASSWORD:", "<set>" if pwd else None)

    if uri and uri.strip() != uri:
        print("⚠️  NEO4J_URI has leading/trailing whitespace!")
    if uri and not re.match(r"^neo4j\+s://.+\.databases\.neo4j\.io$", uri.strip()):
        print("⚠️  URI does not look like a valid Aura URI (neo4j+s://<hash>.databases.neo4j.io).")

def dns_lookup(host):
    print(f"\n🔎 DNS lookup for {host}")
    try:
        info = socket.getaddrinfo(host, 7687)
        pprint.pprint(info)
        print("✅ Host resolved.")
        return True
    except socket.gaierror as e:
        print("❌ DNS resolution failed:", e)
        return False

def port_check(host, port=7687, timeout=3):
    print(f"\n🔎 TCP connect to {host}:{port}")
    try:
        with socket.create_connection((host, port), timeout=timeout):
            print("✅ TCP connection succeeded.")
            return True
    except OSError as e:
        print("❌ TCP connection failed:", e)
        return False

def shell_nslookup(host):
    print("\n🔎 shell nslookup")
    try:
        out = subprocess.check_output(["nslookup", host], text=True)
        print(out)
    except FileNotFoundError:
        print("`nslookup` not available in this container.")
    except subprocess.CalledProcessError as e:
        print("nslookup error:", e.output)

def run_all():
    check_env()

    uri = os.getenv("NEO4J_URI", "")
    host = uri.split("://")[-1] if "://" in uri else uri
    host = host.strip()

    if host:
        if dns_lookup(host):
            port_check(host, 7687)
        shell_nslookup(host)
    else:
        print("\n⛔️ NEO4J_URI is empty; set it before running network tests.")

run_all()

🔎 Environment values
  NEO4J_URI     : 'neo4j+s://f21edcc2.databases.neo4j.io'
  NEO4J_USERNAME: 'neo4j'
  NEO4J_PASSWORD: <set>

🔎 DNS lookup for f21edcc2.databases.neo4j.io
[(<AddressFamily.AF_INET: 2>,
  <SocketKind.SOCK_STREAM: 1>,
  6,
  '',
  ('34.121.155.65', 7687)),
 (<AddressFamily.AF_INET: 2>,
  <SocketKind.SOCK_DGRAM: 2>,
  17,
  '',
  ('34.121.155.65', 7687)),
 (<AddressFamily.AF_INET: 2>,
  <SocketKind.SOCK_RAW: 3>,
  0,
  '',
  ('34.121.155.65', 7687))]
✅ Host resolved.

🔎 TCP connect to f21edcc2.databases.neo4j.io:7687
✅ TCP connection succeeded.

🔎 shell nslookup
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	f21edcc2.databases.neo4j.io
Address: 34.121.155.65




### Connect to Neo4j

In [8]:
kg = Neo4jGraph(
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"]
)


### Function - Extracting Metadata from research paper

- paper id
- file name
- title
- authors
- year
- venue
- keywords
- abstract




In [9]:
def extract_text_from_file(file_path):
    """Extract text content from a file (PDF,TXT or JSON)"""
    # when the file is .pdf
    if file_path.lower().endswith('.pdf'):
        text = ""
        try:
            with pdfplumber.open(file_path) as pdf:
                for page in pdf.pages:
                    extracted = page.extract_text()
                    if extracted:
                        text += extracted + "\n"
        except Exception as e:
            print(f"Error extracting text from {file_path}: {e}")
            return ""
        return text

    # when the file is .txt
    elif file_path.lower().endswith('.txt'):
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()
        except UnicodeDecodeError:
            try:
                with open(file_path, 'r', encoding='latin-1') as file:
                    return file.read()
            except Exception as e:
                print(f"Error reading {file_path}: {e}")
                return ""
    # when the file is .fson
    elif file_path.lower().endswith('.json'):
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                json_data = json.load(file)
                # For JSON files, we'll return both the raw text (for chunking) and the structured data
                # The structured data will be used directly in extract_paper_metadata
                text_content = json.dumps(json_data, indent=2)  # Convert to string for text chunking
                return {"text": text_content, "json_data": json_data}
        except Exception as e:
            print(f"Error reading JSON {file_path}: {e}")
            return ""

    return ""

def generate_paper_id(file_path, content_sample=None):
    """Generate a unique paper ID based on file path and optional content sample"""
    base_name = os.path.basename(file_path)

    # Create a hash from the filename and optionally first 100 chars of content
    if content_sample:
        hash_input = f"{base_name}_{content_sample[:100]}"
    else:
        hash_input = base_name

    # Create a short hash as paper_id
    paper_id = hashlib.md5(hash_input.encode()).hexdigest()[:12]
    return paper_id

def extract_paper_metadata(content, file_path):
    """Extract metadata from paper content using pattern matching or direct JSON extraction"""
    # Check if we're dealing with JSON content (which would be a dict with 'text' and 'json_data')
    if isinstance(content, dict) and 'json_data' in content:
        # We have structured JSON data
        json_data = content['json_data']
        # Get text content for generating ID
        text_content = content['text']

        # Create a basic metadata dict with file info
        base_name = os.path.basename(file_path)
        paper_id = generate_paper_id(file_path, text_content[:500])

        # Initialize metadata with defaults
        metadata = {
            'paper_id': paper_id,
            'filename': base_name,
            'title': '',
            'authors': [],
            'year': '',
            'venue': '',
            'keywords': [],
            'abstract': ''
        }

        # Map JSON fields to metadata fields - adjust these based on JSON structure
        # These are common field names in academic paper JSON metadata
        json_field_mapping = {
            'title': ['title', 'paper_title', 'articleTitle', 'name'],
            'authors': ['authors', 'author', 'creator', 'creators', 'contributors'],
            'year': ['year', 'date', 'publicationDate', 'publication_date', 'publishedAt'],
            'venue': ['venue', 'journal', 'conference', 'publication', 'publisher'],
            'keywords': ['keywords', 'tags', 'subjects', 'categories', 'topics'],
            'abstract': ['abstract', 'summary', 'description']
        }

        # Extract metadata from JSON based on field mapping
        for meta_field, json_fields in json_field_mapping.items():
            for field in json_fields:
                if field in json_data:
                    value = json_data[field]
                    if value:
                        metadata[meta_field] = value
                        break

        # Handle special cases and formatting

        # Authors - ensure it's a list of strings
        if metadata['authors'] and not isinstance(metadata['authors'], list):
            # If authors is a string, try to split it
            if isinstance(metadata['authors'], str):
                metadata['authors'] = [a.strip() for a in re.split(r'[;,]|and', metadata['authors']) if a.strip()]

        # Extract year from date string if year is missing but we have a date
        if not metadata['year'] and any(field in json_data for field in ['date', 'publicationDate', 'publication_date']):
            for date_field in ['date', 'publicationDate', 'publication_date']:
                if date_field in json_data and json_data[date_field]:
                    # Try to extract year from date string using regex
                    date_str = str(json_data[date_field])
                    year_match = re.search(r'(?:19|20)\d{2}', date_str)
                    if year_match:
                        metadata['year'] = year_match.group(0)
                        break

        # Keywords - ensure it's a list of strings
        if metadata['keywords'] and not isinstance(metadata['keywords'], list):
            if isinstance(metadata['keywords'], str):
                metadata['keywords'] = [k.strip().lower() for k in re.split(r'[;,]', metadata['keywords']) if k.strip()]

        # If we have no keywords but have an abstract, extract keywords from abstract
        if not metadata['keywords'] and metadata['abstract']:
            try:
                # Initialize KeyBERT model
                kw_model = KeyBERT()

                # Extract keywords and keyphrases
                # Use ngram_range to extract multi-word phrases (1,3 means 1-3 word phrases)
                # Use top_n to limit the number of keywords/phrases returned
                keywords = kw_model.extract_keywords(
                    metadata['abstract'],
                    keyphrase_ngram_range=(1, 3),
                    stop_words='english',
                    use_mmr=True,      # Use Maximal Marginal Relevance for diversity
                    diversity=0.7,     # Higher diversity means more diverse phrases
                    top_n=10           # Get top 10 keyphrases
                )

                # Keywords will be a list of tuples (phrase, score)
                # Extract just the phrases
                metadata['keywords'] = [keyword[0] for keyword in keywords]

            except ImportError:
                # Fallback to simpler extraction if KeyBERT is not available
                try:
                    import nltk
                    nltk.download('stopwords', quiet=True)
                    nltk.download('punkt', quiet=True)
                    from nltk.corpus import stopwords
                    from nltk.tokenize import word_tokenize
                    from nltk.util import ngrams

                    stop_words = set(stopwords.words('english'))
                    words = word_tokenize(metadata['abstract'].lower())
                    filtered_words = [w for w in words if w.isalpha() and w not in stop_words and len(w) > 3]

                    # Extract single words first
                    word_counts = Counter(filtered_words)

                    # Extract bigrams and trigrams
                    bigrams_list = list(ngrams(filtered_words, 2))
                    trigrams_list = list(ngrams(filtered_words, 3))

                    # Convert tuples to strings for counting
                    bigrams_strings = [' '.join(bigram) for bigram in bigrams_list]
                    trigrams_strings = [' '.join(trigram) for trigram in trigrams_list]

                    # Count n-grams
                    bigram_counts = Counter(bigrams_strings)
                    trigram_counts = Counter(trigrams_strings)

                    # Combine single words, bigrams and trigrams, prioritizing longer phrases
                    # This gives more weight to meaningful multi-word phrases
                    combined_keywords = []

                    # Add top trigrams first
                    combined_keywords.extend([term for term, count in trigram_counts.most_common(3)])

                    # Add top bigrams next
                    combined_keywords.extend([term for term, count in bigram_counts.most_common(3)])

                    # Fill the rest with top single words
                    remaining_slots = 10 - len(combined_keywords)
                    combined_keywords.extend([term for term, count in word_counts.most_common(remaining_slots)])

                    metadata['keywords'] = combined_keywords[:10]  # Limit to top 10
                except:
                    # Super simple fallback
                    words = re.findall(r'\b[a-z]{4,}\b', metadata['abstract'].lower())
                    word_counts = {}
                    for word in words:
                        if word not in ['with', 'that', 'this', 'from', 'have', 'were']:
                            word_counts[word] = word_counts.get(word, 0) + 1
                    metadata['keywords'] = [w for w, c in sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]]

    else:
        # Regular text content processing for PDF/TXT files
        # Create a basic metadata dict with file info
        base_name = os.path.basename(file_path)
        paper_id = generate_paper_id(file_path, content[:500] if isinstance(content, str) else "")

        metadata = {
            'paper_id': paper_id,
            'filename': base_name,
            'title': '',
            'authors': [],
            'year': '',
            'venue': '',
            'keywords': [],
            'abstract': ''
        }

        # Skip if content is empty or not a string
        if not content or not isinstance(content, str):
            metadata['authors'] = 'Unknown'
            metadata['keywords'] = 'research'
            return metadata

        # Extract title - usually at the beginning and often the largest text
        # Look for the first substantial line that's not a header
        title_candidates = re.findall(r'(?:\n|^)([A-Z][^.\n]{10,150})\n', content[:2000])
        if title_candidates:
            metadata['title'] = title_candidates[0].strip()
        else:
            # Simpler approach - just take the first line with substantial text
            first_lines = content[:1000].split('\n')
            for line in first_lines:
                line = line.strip()
                if len(line) > 15 and not line.lower().startswith(('http', 'www', 'proceedings')):
                    metadata['title'] = line
                    break

        # Rest of the existing PDF/TXT extraction code...
        abstract_patterns = [
            # (?i) - Makes the pattern case-insensitive, so it will match "Abstract", "ABSTRACT", "abstract", etc.
            # abstract - Looks for the literal text "abstract"
            # [\s\.\:]+ - Matches one or more whitespace characters, periods, or colons that might follow the word "abstract"
            # ([^\n]+ - Starts capturing the abstract content, matching any characters except newlines
            # (?:\n[^\n]+){0,10}?) - A non-capturing group that allows the pattern to match up to 10 additional lines of text (each line started with a newline character)
            # (?:\n\n|\n[A-Z][a-z]) - Stops capturing when it encounters either: A double newline (indicating a paragraph break) or A newline followed by a capitalized word (likely the start of a new section)
            r'(?i)abstract[\s\.\:]+([^\n]+(?:\n[^\n]+){0,10}?)(?:\n\n|\n[A-Z][a-z])',
            r'(?i)abstract[\s\.\:]+([^\n]+(?:\n[^\n]+){0,10}?)(?:\n\s*keywords|\n\s*index terms)',
            r'(?i)abstract[\s\.\:]+([^\n]+(?:\n[^\n]+){0,10}?)(?:\n\s*introduction|\n\s*1\.)'
        ]

        for pattern in abstract_patterns:
            abstract_match = re.search(pattern, content[:5000], re.DOTALL)
            if abstract_match:
                metadata['abstract'] = abstract_match.group(1).strip()
                break

        author_patterns = [
            r'(?i)(?:authors?|by)[\s\.\:]+([^\n]+)(?:\n)',
            r'(?:^|\n)([A-Z][^,\n]+(?:,[^,\n]+){1,6})(?:\n)'
        ]

        for pattern in author_patterns:
            authors_match = re.search(pattern, content[:2000])
            if authors_match:
                authors_text = authors_match.group(1).strip()
                # Clean and split authors
                authors = re.split(r'(?:,\s*|;|and\s+)', authors_text)
                metadata['authors'] = [a.strip() for a in authors if a.strip()]
                break

        # Extract year - look for a 4-digit year in the first few pages
        year_match = re.search(r'(?:19|20)\d{2}', content[:3000])
        if year_match:
            metadata['year'] = year_match.group(0)

        # Extract keywords
        # (?i) - Case insensitive matching
        # (?:keywords|index terms) - Either "keywords" or "index terms" (non-capturing group)
        # [\s\.\:]+ - Followed by whitespace, periods, or colons
        # ([^\n]+ - Capturing group starts with any characters except newline
        # (?:\n[^\n]+){0,3}?) - Optionally followed by up to 3 lines of text
        # (?:\n\n|\n[A-Z][a-z]) - Ending when either a double newline is found or a newline followed by a capitalized word (likely a new section)
        keyword_patterns = [
            r'(?i)(?:keywords|index terms)[\s\.\:]+([^\n]+(?:\n[^\n]+){0,3}?)(?:\n\n|\n[A-Z][a-z])',
            r'(?i)(?:keywords|index terms)[\s\.\:]+([^\n]+(?:\n[^\n]+){0,3}?)(?:\n\s*introduction|\n\s*1\.)'
        ]

        for pattern in keyword_patterns:
            keywords_match = re.search(pattern, content[:5000], re.DOTALL)
            if keywords_match:
                keywords_text = keywords_match.group(1).strip()
                # Clean and split keywords
                keywords = re.split(r'(?:,\s*|;|\n)', keywords_text)
                metadata['keywords'] = [k.strip().lower() for k in keywords if k.strip()]
                break

        # If we failed to extract meaningful keywords, try to extract from the abstract
        if not metadata['keywords'] and metadata['abstract']:
            import nltk
            try:
                nltk.download('stopwords', quiet=True)
                nltk.download('punkt', quiet=True)
                from nltk.corpus import stopwords
                from nltk.tokenize import word_tokenize

                stop_words = set(stopwords.words('english'))
                words = word_tokenize(metadata['abstract'].lower())
                filtered_words = [w for w in words if w.isalpha() and w not in stop_words and len(w) > 3]

                # Get word frequency
                word_counts = Counter(filtered_words)

                # Extract top 5 words as keywords
                metadata['keywords'] = [word for word, count in word_counts.most_common(5)]
            except:
                # If NLTK fails, use a simple approach
                words = re.findall(r'\b[a-z]{4,}\b', metadata['abstract'].lower())
                word_counts = {}
                for word in words:
                    if word not in ['with', 'that', 'this', 'from', 'have', 'were']:
                        word_counts[word] = word_counts.get(word, 0) + 1
                metadata['keywords'] = [w for w, c in sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]]

    # Convert lists to strings for Neo4j compatibility
    if metadata['authors']:
        if isinstance(metadata['authors'], list):
            metadata['authors'] = '; '.join(metadata['authors'])
    else:
        metadata['authors'] = 'Unknown'

    if metadata['keywords']:
        if isinstance(metadata['keywords'], list):
            metadata['keywords'] = '; '.join(metadata['keywords'])
    else:
        metadata['keywords'] = 'research'

    return metadata



### Function - Knowledge Graph Setup

1. Setup knowledge graph schema
2. Create paper nod
3. Create chunk nodes
4. Create author nodes
5. Create keywork nodes


In [10]:
def setup_kg_schema():
    """Create constraints and indexes in Neo4j"""
    # Create constraints to avoid duplicates
    kg.query("""
    CREATE CONSTRAINT unique_paper IF NOT EXISTS
        FOR (p:Paper) REQUIRE p.paperID IS UNIQUE
    """)

    kg.query("""
    CREATE CONSTRAINT unique_chunk IF NOT EXISTS
        FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
    """)

    kg.query("""
    CREATE CONSTRAINT unique_author IF NOT EXISTS
        FOR (a:Author) REQUIRE a.name IS UNIQUE
    """)

    kg.query("""
    CREATE CONSTRAINT unique_keyword IF NOT EXISTS
        FOR (k:Keyword) REQUIRE k.term IS UNIQUE
    """)

    # Create vector index for semantic search
    kg.query("""
    CREATE VECTOR INDEX research_paper_chunks IF NOT EXISTS
      FOR (c:Chunk) ON (c.textEmbedding)
      OPTIONS { indexConfig: {
        `vector.dimensions`: 1536,
        `vector.similarity_function`: 'cosine'
      }}
    """)

    print("Knowledge graph schema setup complete!")

def create_paper_node(metadata):
    """Create a Paper node in the graph"""
    cypher = """
    MERGE (p:Paper {paperID: $paperParam.paper_id})
      ON CREATE
        SET p.title = $paperParam.title,
            p.year = $paperParam.year,
            p.venue = $paperParam.venue,
            p.abstract = $paperParam.abstract,
            p.filename = $paperParam.filename
    """
    kg.query(cypher, params={'paperParam': metadata})

def create_chunk_nodes(chunks_with_metadata):
    """Create Chunk nodes and connect to Paper"""
    for chunk in chunks_with_metadata:
        # Create chunk node
        cypher = """
        MERGE (c:Chunk {chunkId: $chunkParam.chunkId})
          ON CREATE
            SET c.text = $chunkParam.text,
                c.paperTitle = $chunkParam.paperTitle,
                c.sectionSeqId = $chunkParam.sectionSeqId,
                c.paperID = $chunkParam.paperID
        """
        kg.query(cypher, params={'chunkParam': chunk})

        # Connect chunk to paper
        cypher = """
        MATCH (c:Chunk {chunkId: $chunkParam.chunkId}),
              (p:Paper {paperID: $chunkParam.paperID})
        MERGE (c)-[:PART_OF]->(p)
        """
        kg.query(cypher, params={'chunkParam': chunk})

def create_author_nodes(metadata):
    """Create Author nodes and connect to Paper"""
    if isinstance(metadata['authors'], str):
        authors = [a.strip() for a in re.split(r'[;,]', metadata['authors']) if a.strip()]
    else:
        authors = metadata['authors']

    for author in authors:
        if not author or author.lower() == 'unknown':
            continue

        # Create author node
        cypher = """
        MERGE (a:Author {name: $authorName})
        """
        kg.query(cypher, params={'authorName': author})

        # Connect author to paper
        cypher = """
        MATCH (a:Author {name: $authorName}),
              (p:Paper {paperID: $paperID})
        MERGE (a)-[:AUTHORED]->(p)
        """
        kg.query(cypher, params={'authorName': author, 'paperID': metadata['paper_id']})

def create_keyword_nodes(metadata):
    """Create Keyword nodes and connect to Paper"""
    if isinstance(metadata['keywords'], str):
        keywords = [k.strip().lower() for k in re.split(r'[;,]', metadata['keywords']) if k.strip()]
    else:
        keywords = metadata['keywords']

    for keyword in keywords:
        if not keyword:
            continue

        # Create keyword node
        cypher = """
        MERGE (k:Keyword {term: $keywordTerm})
        """
        kg.query(cypher, params={'keywordTerm': keyword})

        # Connect keyword to paper
        cypher = """
        MATCH (k:Keyword {term: $keywordTerm}),
              (p:Paper {paperID: $paperID})
        MERGE (p)-[:HAS_TOPIC]->(k)
        """
        kg.query(cypher, params={'keywordTerm': keyword, 'paperID': metadata['paper_id']})

# def create_chunk_sequence(paper_id):
#     """Link chunks in sequence within the same paper"""
#     cypher = """
#     MATCH (chunks:Chunk)
#     WHERE chunks.paperID = $paperID
#     WITH chunks ORDER BY chunks.sectionSeqId ASC
#     WITH collect(chunks) as paper_chunks
#     CALL apoc.nodes.link(paper_chunks, "NEXT", {avoidDuplicates: true})
#     RETURN size(paper_chunks)
#     """
#     result = kg.query(cypher, params={'paperID': paper_id})
#     chunk_count = result[0] if result else 0
#     return chunk_count

### Preprocess Papers

When the title of the paper is 'Not Known', we exclude such papers

In [11]:
# PAPERS_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/arxiv_markdowns2'
PAPERS_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/arxiv_json'
OUTPUT_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files'

# Clear the output directory first
if os.path.exists(OUTPUT_DIR):
    # Remove all files in the output directory
    for file in glob.glob(os.path.join(OUTPUT_DIR, "*")):
        os.remove(file)
else:
    # Create output directory if it doesn't exist
    os.makedirs(OUTPUT_DIR)

json_files = glob.glob(os.path.join(PAPERS_DIR, "*.json"))
print(f"Found {len(json_files)} JSON files in {PAPERS_DIR}")

valid_files_count = 0
invalid_files_count = 0

# Process each JSON file
for file_path in json_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        if 'title' in data and data['title'] != "Not Found":
            dest_path = os.path.join(OUTPUT_DIR, os.path.basename(file_path))
            shutil.copy2(file_path, dest_path)
            valid_files_count += 1
        else:
            invalid_files_count += 1

    except json.JSONDecodeError:
        print(f"Error: Could not decode JSON in file {file_path}")
    except Exception as e:
        print(f"Error processing file {file_path}: {str(e)}")

print(f"Valid files copied: {valid_files_count}")
print(f"Invalid files excluded: {invalid_files_count}")

# Check what's in the output directory
paper_files = glob.glob(os.path.join(OUTPUT_DIR, "*.json"))
print(f"Found {len(paper_files)} JSON files in {OUTPUT_DIR}")

for file in paper_files[:5]:
    print(f" - {os.path.basename(file)}")

Found 197 JSON files in /content/drive/MyDrive/Colab_Notebooks/AI/arxiv_json
Valid files copied: 197
Invalid files excluded: 0
Found 197 JSON files in /content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files
 - 2211.02069v2.json
 - 2211.04715v1.json
 - 1911.09661v1.json
 - 2102.02503v1.json
 - 2210.10723v2.json


### Function - Process paper and build knowledge graph

In [12]:
def process_papers(paper_files, batch_size=5):
    """Process all papers and build the knowledge graph"""
    # Setup schema
    setup_kg_schema()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2000,
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )

    # Process papers in batches
    for batch_start in range(0, len(paper_files), batch_size):
        batch_end = min(batch_start + batch_size, len(paper_files))
        batch = paper_files[batch_start:batch_end]

        print(f"Processing batch {batch_start//batch_size + 1} ({batch_start+1}-{batch_end} of {len(paper_files)} papers)")

        # Process each paper in the batch
        for i, file_path in enumerate(batch):
            try:
                print(f"  [{batch_start+i+1}/{len(paper_files)}] Processing: {os.path.basename(file_path)}")

                # Extract text content
                content = extract_text_from_file(file_path)
                if not content:
                    print(f"    Skipping - no content extracted")
                    continue

                # Extract metadata from content
                metadata = extract_paper_metadata(content, file_path)
                print(f"    Title: {metadata['title'][:50] + '...' if len(metadata['title']) > 50 else metadata['title']}")
                print(f"    Authors: {metadata['authors'][:50] + '...' if len(metadata['authors']) > 50 else metadata['authors']}")

                # Get text for chunking (handle both string and dict with 'text' key)
                text_for_chunking = content['text'] if isinstance(content, dict) and 'text' in content else content

                # Split into chunks
                chunks = text_splitter.split_text(text_for_chunking)
                print(f"    Split into {len(chunks)} chunks")

                # Create paper node
                create_paper_node(metadata)

                # Create chunks with metadata
                chunks_with_metadata = []
                for j, chunk in enumerate(chunks):
                    chunks_with_metadata.append({
                        'text': chunk,
                        'paperTitle': metadata['title'],
                        'sectionSeqId': j,
                        'paperID': metadata['paper_id'],
                        'chunkId': f"{metadata['paper_id']}_chunk_{j:04d}"
                    })

                # Create nodes and relationships
                create_chunk_nodes(chunks_with_metadata)
                create_author_nodes(metadata)
                create_keyword_nodes(metadata)
                # sequence_count = create_chunk_sequence(metadata['paper_id'])
                # print(f"    Created sequence of {sequence_count} chunks")

            except Exception as e:
                print(f"    Error processing paper {file_path}: {e}")

        print(f"Batch {batch_start//batch_size + 1} completed.")

    print("Paper processing complete!")


### Function - Vector embedding

In [13]:
def generate_embeddings(batch_size=20):
    """Generate and store embeddings for chunks in batches"""
    embedding_model = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])

    # Get chunks without embeddings
    cypher = """
    MATCH (chunk:Chunk)
    WHERE chunk.textEmbedding IS NULL
    RETURN chunk.chunkId AS id, chunk.text AS text
    LIMIT $batchSize
    """
    chunks = kg.query(cypher, params={'batchSize': batch_size})

    if not chunks:
        print("No chunks need embeddings!")
        return 0

    print(f"Generating embeddings for {len(chunks)} chunks...")
    processed = 0

    for chunk in chunks:
        try:
            # Generate embedding
            embedding = embedding_model.embed_query(chunk['text'])

            # Update the chunk with embedding
            update_query = """
            MATCH (c:Chunk {chunkId: $id})
            SET c.textEmbedding = $embedding
            """
            kg.query(update_query, params={'id': chunk['id'], 'embedding': embedding})
            processed += 1

            # Show progress
            if processed % 5 == 0:
                print(f"  Processed {processed}/{len(chunks)} chunks")

        except Exception as e:
            print(f"  Error generating embedding for chunk {chunk['id']}: {str(e)[:100]}...")

    # Check if there are more to process
    remaining = kg.query("""
    MATCH (chunk:Chunk)
    WHERE chunk.textEmbedding IS NULL
    RETURN count(chunk) as count
    """)[0]['count']

    print(f"Processed batch of {processed} chunks. {remaining} chunks still need embeddings.")
    return remaining

### Function - QA Interface

In [14]:
def setup_qa_system():
    """Set up a QA system for semantic search"""
    # Define the retrieval query with context
    retrieval_query = """
    MATCH (node)-[:PART_OF]->(p:Paper)
    OPTIONAL MATCH (p)<-[:AUTHORED]-(author:Author)
    OPTIONAL MATCH (p)-[:HAS_TOPIC]->(keyword:Keyword)
    WITH node, score, p,
         collect(distinct author.name) as authors,
         collect(distinct keyword.term) as keywords
    RETURN "Paper: " + p.title +
           CASE WHEN p.year <> 'Unknown' THEN " (" + p.year + ")" ELSE "" END +
           "\nAuthors: " + CASE WHEN size(authors) > 0 THEN apoc.text.join(authors, ", ") ELSE "Unknown" END +
           "\nKeywords: " + CASE WHEN size(keywords) > 0 THEN apoc.text.join(keywords, ", ") ELSE "None" END +
           CASE WHEN p.abstract <> '' THEN "\nAbstract: " + left(p.abstract, 200) + "..." ELSE "" END +
           "\n\n" + node.text AS text,
        score,
        { source: p.filename, title: p.title, paperID: p.paperID} AS metadata
    """

    # Create vector store
    vector_store = Neo4jVector.from_existing_index(
        embedding=OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"]),
        url=os.environ["NEO4J_URI"],
        username=os.environ["NEO4J_USERNAME"],
        password=os.environ["NEO4J_PASSWORD"],
        index_name="research_paper_chunks",
        text_node_property="text",
        retrieval_query=retrieval_query,
    )

    # Create retriever and QA chain
    retriever = vector_store.as_retriever(search_kwargs={"k": 5})

    qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
        ChatOpenAI(temperature=0, api_key=os.environ["OPENAI_API_KEY"]),
        chain_type="stuff",
        retriever=retriever
    )

    return qa_chain

### Function - Analysis
1. Find clusters of papers based on shared keywords
2. Find papers that share topics
3. Get statistics about the knowledge graph

In [15]:
def analyze_research_clusters():
    """Find clusters of papers based on shared keywords"""
    cypher = """
    MATCH (p:Paper)-[:HAS_TOPIC]->(k:Keyword)
    WITH k, collect(p) AS papers
    WHERE size(papers) > 1
    RETURN k.term AS topic,
           [p IN papers | p.title] AS paperTitles,
           size(papers) AS clusterSize
    ORDER BY clusterSize DESC
    LIMIT 10
    """
    clusters = kg.query(cypher)

    print("\nResearch clusters by topic:")
    for cluster in clusters:
        print(f"\nTopic: {cluster['topic']} ({cluster['clusterSize']} papers)")
        for i, title in enumerate(cluster['paperTitles']):
            print(f"  {i+1}. {title}")

    return clusters

def find_paper_connections():
    """Find papers that share topics"""
    cypher = """
    MATCH (p1:Paper)-[:HAS_TOPIC]->(k:Keyword)<-[:HAS_TOPIC]-(p2:Paper)
    WHERE p1.paperID < p2.paperID  // To avoid duplicate pairs
    WITH p1, p2, collect(k.term) as shared_topics
    RETURN p1.title AS paper1,
           p2.title AS paper2,
           shared_topics,
           size(shared_topics) AS connection_strength
    ORDER BY connection_strength DESC
    LIMIT 15
    """
    connections = kg.query(cypher)

    print("\nPaper connections through shared topics:")
    for conn in connections:
        print(f"\n{conn['paper1']} <--> {conn['paper2']}")
        print(f"  Shared topics ({conn['connection_strength']}): {', '.join(conn['shared_topics'])}")

    return connections

def get_kg_statistics():
    """Get statistics about the knowledge graph"""
    stats = {}

    # Count papers
    stats['paper_count'] = kg.query("""
    MATCH (p:Paper)
    RETURN count(p) as count
    """)[0]['count']

    # Count chunks
    stats['chunk_count'] = kg.query("""
    MATCH (c:Chunk)
    RETURN count(c) as count
    """)[0]['count']

    # Count authors
    stats['author_count'] = kg.query("""
    MATCH (a:Author)
    RETURN count(a) as count
    """)[0]['count']

    # Count keywords/topics
    stats['keyword_count'] = kg.query("""
    MATCH (k:Keyword)
    RETURN count(k) as count
    """)[0]['count']

    # Count relationships
    stats['relationship_count'] = kg.query("""
    MATCH ()-[r]->()
    RETURN count(r) as count
    """)[0]['count']

    print("\nKnowledge Graph Statistics:")
    print(f"Papers: {stats['paper_count']}")
    print(f"Chunks: {stats['chunk_count']}")
    print(f"Authors: {stats['author_count']}")
    print(f"Keywords/Topics: {stats['keyword_count']}")
    print(f"Relationships: {stats['relationship_count']}")

    return stats

In [16]:
def ask_research_question(qa_chain, question):
    """Ask a question about the research papers"""
    print(f"\nQuestion: {question}\n")
    try:
        answer = qa_chain({"question": question}, return_only_outputs=True)
        print(f"Answer: {answer['answer']}\n")
        print(f"Sources: {answer['sources']}")
        return answer
    except Exception as e:
        print(f"Error processing question: {e}")
        return None


## Ask questions

**Below cell takes several hours when running for the frist time, so we will use the resue_existing flag**.

- When running for the first time, we set the resue_existing as False, so it will build the knowledge graph from scratch.
- Later, we can set the reuse_existing as True. so that we can reuse the existing knowledge graph.

In [17]:
# Suppress the CropBox warnings
logging.getLogger('pdfminer').setLevel(logging.ERROR)

# We set a flag reuse_existing: True to reuse existing data, False to build from scratch
reuse_existing = True

def execute_with_retry(query_func, max_retries=3, retry_delay=5):
    """Execute a database operation with retry logic"""
    retries = 0
    while retries < max_retries:
        try:
            return query_func()
        except (ServiceUnavailable, IncompleteCommit, OSError) as e:
            retries += 1
            if retries == max_retries:
                raise
            print(f"Connection error: {e}. Retrying in {retry_delay} seconds... (Attempt {retries}/{max_retries})")
            time.sleep(retry_delay)

# Check if we already have data in the graph
try:
    existing_papers = execute_with_retry(
        lambda: kg.query("""
            MATCH (p:Paper)
            RETURN count(p) as count
        """)[0]['count']
    )
    print(f"Found {existing_papers} existing papers in the knowledge graph.")

    if reuse_existing and existing_papers > 0:
        print("Reusing existing knowledge graph data.")
    else:
        # If not reusing or no existing data, build from scratch
        if existing_papers > 0:
            print("Clearing existing data from the knowledge graph...")
            # Break down the delete operation into smaller batches to avoid timeout
            def clear_graph():
                # First delete relationships in batches
                while True:
                    result = kg.query("""
                        MATCH ()-[r]->()
                        WITH r LIMIT 5000
                        DELETE r
                        RETURN count(r) as deleted
                    """)
                    deleted = result[0]['deleted']
                    print(f"Deleted {deleted} relationships...")
                    if deleted == 0:
                        break

                # Then delete nodes in batches
                while True:
                    result = kg.query("""
                        MATCH (n)
                        WITH n LIMIT 5000
                        DELETE n
                        RETURN count(n) as deleted
                    """)
                    deleted = result[0]['deleted']
                    print(f"Deleted {deleted} nodes...")
                    if deleted == 0:
                        break
                return True

            execute_with_retry(clear_graph)
            print("Knowledge graph cleared.")

        print("Building new knowledge graph...")
        execute_with_retry(lambda: process_papers(paper_files, batch_size=5))

    print("Generating embeddings...")
    remaining = 1
    while remaining > 0:
        remaining = execute_with_retry(lambda: generate_embeddings(batch_size=20))
    print("Embedding generation complete.")

    # Display knowledge graph statistics
    stats = execute_with_retry(get_kg_statistics)
    print(f"Knowledge graph statistics: {stats}")

    print("Setting up QA system...")
    qa_chain = execute_with_retry(setup_qa_system)

    print("\nAnalyzing research clusters...")
    execute_with_retry(analyze_research_clusters)

    print("\nFinding paper connections...")
    execute_with_retry(find_paper_connections)

    print("\nReady to answer questions about the research papers!")

except Exception as e:
    print(f"Error: {e}")
    print("Tip: If you're experiencing connection issues, try the following:")
    print("1. Check your Neo4j database connection settings")
    print("2. Make sure your Neo4j instance has enough memory")
    print("3. Break large operations into smaller batches")
    print("4. Check if your cloud provider (if using one) is experiencing issues")

Found 197 existing papers in the knowledge graph.
Reusing existing knowledge graph data.
Generating embeddings...
No chunks need embeddings!
Embedding generation complete.

Knowledge Graph Statistics:
Papers: 197
Chunks: 9731
Authors: 1951
Keywords/Topics: 1841
Relationships: 14005
Knowledge graph statistics: {'paper_count': 197, 'chunk_count': 9731, 'author_count': 1951, 'keyword_count': 1841, 'relationship_count': 14005}
Setting up QA system...

Analyzing research clusters...

Research clusters by topic:

Topic: large language models (44 papers)
  1. Enhancing Advanced Visual Reasoning Ability of Large Language Models
  2. The structure of the token space for large language models
  3. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  4. Explaining Large Language Model-Based Neural Semantic Parsers (Student Abstract)
  5. Selection Bias Induced Spurious Correlations in Large Language Models
  6. ChatGPT and Other Large Language Models as Evolutionary Engines for

In [18]:
ask_research_question(qa_chain, "What methodologies are commonly used in these papers?")


Question: What methodologies are commonly used in these papers?

Answer: The methodologies commonly used in these papers include natural language processing techniques, retrieval-augmented generation, vision-language modeling, alignment of radiology reports, and large language models for news recommender systems.


Sources: 2411.01195v1.json, 2408.12141v1.json, 2310.12321v1.json, 2502.09797v2.json, 2411.18583v1.json


{'answer': 'The methodologies commonly used in these papers include natural language processing techniques, retrieval-augmented generation, vision-language modeling, alignment of radiology reports, and large language models for news recommender systems.\n',
 'sources': '2411.01195v1.json, 2408.12141v1.json, 2310.12321v1.json, 2502.09797v2.json, 2411.18583v1.json'}

In [19]:
ask_research_question(qa_chain, "Summarize the key findings across these papers.")


Question: Summarize the key findings across these papers.

Answer: Key findings across these papers include the use of Natural Language Processing techniques and retrieval-augmented generation to automate literature reviews, the improvement of zero-shot reasoning abilities of large language models, the observation of biases in case judgment summaries generated by legal datasets and large language models, and the importance of transfer learning for finetuning large language models.


Sources: 2411.18583v1.json, 2310.03710v2.json, 2312.00554v1.json, 2411.01195v1.json


{'answer': 'Key findings across these papers include the use of Natural Language Processing techniques and retrieval-augmented generation to automate literature reviews, the improvement of zero-shot reasoning abilities of large language models, the observation of biases in case judgment summaries generated by legal datasets and large language models, and the importance of transfer learning for finetuning large language models.\n',
 'sources': '2411.18583v1.json, 2310.03710v2.json, 2312.00554v1.json, 2411.01195v1.json'}

In [20]:
ask_research_question(qa_chain, "What are the most influential authors in this field?")


Question: What are the most influential authors in this field?

Answer: The most influential authors in this field are Gerd Gigerenzer, Henry Brighton, E Gold, Mark, Alison Gopnik, Henry M Wellman, Dirk Groeneveld, Iz Beltagy, Pete Walsh, and others.


Sources: 2406.13138v2.json


{'answer': 'The most influential authors in this field are Gerd Gigerenzer, Henry Brighton, E Gold, Mark, Alison Gopnik, Henry M Wellman, Dirk Groeneveld, Iz Beltagy, Pete Walsh, and others.\n',
 'sources': '2406.13138v2.json'}

In [21]:
ask_research_question(qa_chain, "What are the limitations mentioned in these papers?")


Question: What are the limitations mentioned in these papers?

Answer: The limitations mentioned in these papers include the dynamic nature of code leading to changes over time, lack of transparency in training methodologies, and non-reproducible papers due to companies not being transparent about their training methods.


Sources: 2411.01195v1.json, 2403.15230v1.json


{'answer': 'The limitations mentioned in these papers include the dynamic nature of code leading to changes over time, lack of transparency in training methodologies, and non-reproducible papers due to companies not being transparent about their training methods.\n',
 'sources': '2411.01195v1.json, 2403.15230v1.json'}

In [22]:
ask_research_question(qa_chain, "What future research directions are suggested?")


Question: What future research directions are suggested?

Answer: Future research directions suggested include working on developing research directions beyond the scope of the current paper, exploring the impact of large language models on how work is carried out, and investigating the potential impacts of large language models on various fields such as business process management and robotics education.


Sources: 2304.04309v1.json, 2402.06116v1.json


{'answer': 'Future research directions suggested include working on developing research directions beyond the scope of the current paper, exploring the impact of large language models on how work is carried out, and investigating the potential impacts of large language models on various fields such as business process management and robotics education.\n',
 'sources': '2304.04309v1.json, 2402.06116v1.json'}

In [23]:
ask_research_question(qa_chain, "What large language model are used in these papers")


Question: What large language model are used in these papers

Answer: The large language models used in these papers include GPT-2, OpenAI's GPT series, and other models mentioned in the references.


Sources: 2405.15628v1.json, 2503.01887v1.json, 2310.03710v2.json, 2307.05782v2.json


{'answer': "The large language models used in these papers include GPT-2, OpenAI's GPT series, and other models mentioned in the references.\n",
 'sources': '2405.15628v1.json, 2503.01887v1.json, 2310.03710v2.json, 2307.05782v2.json'}

# 💡 Ask you own question!

In [24]:
# ask_research_question(qa_chain, "Your question")

### Visualize the knowledge graph

In [25]:
# Visualize the knowledge graph
from neo4j import GraphDatabase
from yfiles_jupyter_graphs import GraphWidget

from google.colab import output
output.enable_custom_widget_manager()

In [26]:
# # Below default_cypther looks for:
# # (n) - A node (any node) which we're calling "n"
# # -[r]-> - A relationship (any relationship) which we're calling "r", with the arrow indicating direction
# # (m) - Another node which we're calling "m"
# # RETURN n,r,m - Return all three elements: the starting node, the relationship, and the ending node
# # LIMIT 50 - Only return up to 50 results, we can also set the limit to 5000.
# default_cypher = "MATCH (n)-[r]->(m) RETURN n,r,m LIMIT 50"

# def showGraph(cypher: str = default_cypher):
#     # create a neo4j session to run queries
#     driver = GraphDatabase.driver(
#         uri=os.environ["NEO4J_URI"],
#         auth=(os.environ["NEO4J_USERNAME"],
#               os.environ["NEO4J_PASSWORD"]))
#     session = driver.session()
#     widget = GraphWidget(graph=session.run(cypher).graph())
#     widget.node_label_mapping = 'id'
#     display(widget)
#     return widget

# showGraph()

In [27]:

def showGraph(keyword=None, limit=500):
    """
    Visualize the knowledge graph with keywords/topics as central nodes
    and connected papers as edges.

    Parameters:
    - keyword: Optional string to filter for a specific keyword
    - limit: Maximum number of nodes to return
    """
    # create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri=os.environ["NEO4J_URI"],
        auth=(os.environ["NEO4J_USERNAME"],
              os.environ["NEO4J_PASSWORD"]))
    session = driver.session()

    if keyword:
        # Query for a specific keyword and its connected papers
        cypher = f"""
        MATCH (k:Keyword {{term: $keyword}})<-[r1:HAS_TOPIC]-(p:Paper)
        OPTIONAL MATCH (p)-[r2:HAS_TOPIC]->(k2:Keyword)
        WHERE k <> k2
        RETURN k, r1, p, r2, k2
        LIMIT {limit}
        """
        params = {"keyword": keyword}
    else:
        # Query that puts keywords at the center with connected papers
        cypher = f"""
        MATCH (k:Keyword)<-[r1:HAS_TOPIC]-(p:Paper)
        RETURN k, r1, p
        LIMIT {limit}
        """
        params = {}

    # Run the query and display the graph
    widget = GraphWidget(graph=session.run(cypher, params).graph())

    # Configure the visualization
    widget.node_label_mapping = 'id'

    # Make keyword nodes more prominent
    widget.node_style_mapping = lambda node: {
        'fill': '#4CAF50' if 'Keyword' in node.labels else '#2196F3',
        'shape': 'ellipse' if 'Keyword' in node.labels else 'rectangle',
        'size': 50 if 'Keyword' in node.labels else 35
    }

    # Edge styles based on relationship type
    widget.edge_style_mapping = lambda edge: {
        'stroke': '#FF5722' if edge.type == 'HAS_TOPIC' else '#9C27B0',
        'stroke-width': 3 if edge.type == 'HAS_TOPIC' else 1
    }

    display(widget)
    return widget


In [37]:
# Show a specific keyword
showGraph(keyword="large language models")
# showGraph(keyword="gpt")

GraphWidget(layout=Layout(height='800px', width='100%'))

GraphWidget(layout=Layout(height='800px', width='100%'))

### Writing Cypher with an LLM

Print the schema of the knowledge graph

In [29]:
kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))

Node properties: Paper {paperID: STRING, title: STRING,
year: STRING, venue: STRING, abstract: STRING, filename:
STRING} Chunk {paperID: STRING, chunkId: STRING,
textEmbedding: LIST, text: STRING, paperTitle: STRING,
sectionSeqId: INTEGER} Author {name: STRING} Keyword {term:
STRING} Relationship properties:  The relationships:
(:Paper)-[:HAS_TOPIC]->(:Keyword)
(:Chunk)-[:PART_OF]->(:Paper)
(:Author)-[:AUTHORED]->(:Paper)


In [30]:
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to
query a graph database.
Instructions:
Use only the provided relationship types and properties in the
schema. Do not use any other relationship types or properties that
are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than
for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Examples: Here are a few examples of generated Cypher
statements for particular questions:

# Find papers that share topics
MATCH (p1:Paper)-[:HAS_TOPIC]->(k:Keyword)<-[:HAS_TOPIC]-(p2:Paper)
    WHERE p1.paperID < p2.paperID  // To avoid duplicate pairs
    WITH p1, p2, collect(k.term) as shared_topics
    RETURN p1.title AS paper1,
           p2.title AS paper2,
           shared_topics,
           size(shared_topics) AS connection_strength
    ORDER BY connection_strength DESC
The question is:
{question}"""

In [31]:
from langchain.chains import GraphCypherQAChain
from langchain.prompts.prompt import PromptTemplate
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"],
    template=CYPHER_GENERATION_TEMPLATE
)

In [32]:
cypherChain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=kg,
    verbose=True,
    cypher_prompt=CYPHER_GENERATION_PROMPT,
    allow_dangerous_requests=True
)

In [33]:
def prettyCypherChain(question: str) -> str:
    response = cypherChain.run(question)
    print(textwrap.fill(response, 60))

In [34]:
prettyCypherChain("Please give me the most popular topics of these paper?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Paper)-[:HAS_TOPIC]->(k:Keyword)
RETURN k.term AS topic, COUNT(p) AS popularity
ORDER BY popularity DESC[0m
Full Context:
[32;1m[1;3m[{'topic': 'large language models', 'popularity': 44}, {'topic': 'large', 'popularity': 10}, {'topic': 'large language', 'popularity': 9}, {'topic': 'large language model', 'popularity': 8}, {'topic': 'language models llms', 'popularity': 5}, {'topic': 'state', 'popularity': 4}, {'topic': 'including', 'popularity': 4}, {'topic': 'language', 'popularity': 4}, {'topic': 'language models', 'popularity': 3}, {'topic': 'gpt', 'popularity': 3}][0m

[1m> Finished chain.[0m
The most popular topics of these papers are large language
models, large, and large language.


# 💡 Ask you own question!
<!-- <img src = 'https://www.shutterstock.com/image-photo/wood-letters-try-word-acronym-260nw-693999751.jpg'> -->

<!-- <img src = 'https://static.vecteezy.com/system/resources/previews/010/784/649/non_2x/yellow-abstract-lamp-and-question-mark-logo-creative-logo-idea-concept-logo-concept-vector.jpg'> -->

In [35]:
# prettyCypherChain("Your question")