# Academic Search System

- Author: [Heeah Kim](https://github.com/yellowGangneng)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/19-Cookbook/07-AcademicQASystemUsingGraphRAG.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/19-Cookbook/07-AcademicQASystemUsingGraphRAG.ipynb)

## Overview

This tutorial involves loading an open academic publication dataset called *OpenAlex* into a Graph DB named *Neo4J*.

Then, utilizing an LLM to generate <U>Cypher queries</U>, which are essentially queries for the Graph DB,
and using the data obtained from these Cypher queries to produce appropriate answers to inquiries,
we will build an *Academic Search System*.

![academinc-search-system]() # TODO : academic search system img추가

Before we dive into the tutorial, let's understand what **GraphRAG** is and why we should use it!

**GraphRAG** is indeed the RAG we already know very well. However, it refers to the inclusion of not only vectors but also a knowledge graph in the RAG's search path.

**GraphRAG** refers to the RAG we already know well, but extended to include <U>not only vectors but also a **knowledge graph** in the search path.</U>

But what are the advantages of using this **GraphRAG** that we need to explore?
The reasons are as follows.

1. You can obtain more accurate and higher quality results.
    - According to Microsoft, using **GraphRAG** allowed them to obtain more relevant contexts, which led to better answers. It also made it easier to trace the grounds for those answers. 
    - Additionally, it required 26~97% fewer tokens, resulting in cost savings and scalability benefits.

2. It enhances data comprehension.
    - When looking at vectors represented by numerous numbers, it is nearly impossible for a human to conceptually and intuitively understand them.
    <br>![vector-data]() # TODO : vector data img추가
    <br>However, graphs are highly intuitive. They make it much easier to understand the relationships between data.
    <br>![graph-data]() # TODO : graph data img추가
    <br>By exploring such intuitive graphs, you can gain new insights.

3. Management becomes easier in terms of tracking, explaining, and access control.
    - Using graphs, you can trace why certain data was selected or why errors occurred. This traceability can be used to explain the results.
    - Additionally, you can assign data permissions within the knowledge graph, enhancing security and privacy protection.

Knowing what **GraphRAG** is makes you want to use it even more, doesn't it?
Now, let's embark on creating an *Academic Search System* together!

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)

### References

- [Create a Neo4j GraphRAG Workflow Using LangChain and LangGraph](https://neo4j.com/developer-blog/neo4j-graphrag-workflow-langchain-langgraph/)
- [The GraphRAG Manifesto: Adding Knowledge to GenAI](hhttps://neo4j.com/blog/graphrag-manifesto/)
- [Graph-Based-Literature-Review-Tool](https://github.com/vtmike2015/Graph-Based-Literature-Review-Tool/tree/main)
- [GraphRAG : Neo4j DB와 LangChain 결합을 통한 질의응답 구현하기 (Kaggle CSV 데이터 적용하기)](https://uoahvu.tistory.com/entry/GraphRAG-Neo4j-DB%EC%99%80-LangChain-%EA%B2%B0%ED%95%A9%EC%9D%84-%ED%86%B5%ED%95%9C-%EC%A7%88%EC%9D%98%EC%9D%91%EB%8B%B5-%EA%B5%AC%ED%98%84%ED%95%98%EA%B8%B0-Kaggle-CSV-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%A0%81%EC%9A%A9%ED%95%98%EA%B8%B0)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [9]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [10]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

In [11]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Academic Search System",  # title 과 동일하게 설정해 주세요
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [20]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

#### 패키지 추가 전까지 임시방편

In [13]:
!pip install langchain-neo4j



In [14]:
!pip install pyalex



In [89]:
from pyalex import Works
import json
from neo4j import GraphDatabase
import os
import ast

In [86]:
uri = os.environ["NEO4J_URL"]
username = os.environ["NEO4J_USERNAME"]
password = os.environ["NEO4J_PASSWORD"]
driver = GraphDatabase.driver(uri, auth=(username, password))

In [16]:
import os

In [88]:
os.getcwd()

'D:\\YellowGangneng\\LangChain-OpenTutorial\\19-Cookbook\\04-GraphRAG'

사전 작업
- 도커로 Neo4J DB 띄우기


pyalex 이용해서 Ariticial 문헌 정보 JSON 파일로 다운받기

노드 및 프로퍼티 

- Works
  - cited_by_count
  - display_name
  - is_paratext
  - publication_year
  - title
  - type
  - url
  - authorships
  - topics
- Authorship
  - affiliations
  - author
  - author_position
- Author
  - affiliations
  - cited_by_count
  - works_count
  - title 
- Topic
  - description
  - display_name
  - domain
  - field
  - keywords
  - works_count

In [99]:
concept_id = "C154945302"

with driver.session() as session:
    pager = (

        Works()

        .filter(
            concept={"id": {concept_id}}
        )  # Concept ID = C154945302 = Artificial Intelligence 학술 문헌 데이터
        .paginate(per_page=1, n_max=10)
    )

    for page in pager:
        print(page[0]['authorships'][0]['author']['display_name'])
        


    # page_count = 1


    # for page in pager:

    #     file = "./data/" + concept_id + "_Page_" + str(page_count) + ".json"

    #     out_file = open(file, "w")

    #     json.dump(page, out_file, indent=6)

    #     out_file.close()

    #     # print(page)
    #     print(

    #         "Now Downloading Page " + str(page_count) + " For Concept ID " + concept_id
    #     )


        page_count += 1

Kaiming He
Leo Breiman
Yoav Benjamini
Stephen F. Altschul
Stephen F. Altschul
Karen Simonyan
Icek Ajzen
Olaf Ronneberger
Alex Krizhevsky
Yann LeCun


중요 프로퍼티 인덱스 생성

In [23]:
driver.execute_query(
    "CREATE INDEX Institutions IF NOT EXISTS FOR \
    (i:Institutions) ON (i.id)"
)
driver.execute_query(
    "CREATE INDEX Concept IF NOT EXISTS FOR \
    (i:Concept) ON (i.id)"
)
driver.execute_query(
    "CREATE INDEX Work_ID IF NOT EXISTS FOR \
    (i:Work) ON (i.id)"
)
driver.execute_query(
    "CREATE INDEX Author IF NOT EXISTS FOR \
    (i:Author) ON (i.id)"
)
driver.execute_query(
    "CREATE INDEX Source IF NOT EXISTS FOR \
    (i:Source) ON (i.id)"
)


EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0x000001F39C2C68D0>, keys=[])

In [None]:
os.getcwd()

pyalex data를 저장해놓은 곳으로 이동

In [44]:
os.chdir(os.getcwd() + "/data")

In [49]:
from glob import glob

directory_list = glob("./*.json")
directory_list = [directory[2:] for directory in directory_list]
directory_list

['C154945302_Page_1.json',
 'C154945302_Page_10.json',
 'C154945302_Page_100.json',
 'C154945302_Page_11.json',
 'C154945302_Page_12.json',
 'C154945302_Page_13.json',
 'C154945302_Page_14.json',
 'C154945302_Page_15.json',
 'C154945302_Page_16.json',
 'C154945302_Page_17.json',
 'C154945302_Page_18.json',
 'C154945302_Page_19.json',
 'C154945302_Page_2.json',
 'C154945302_Page_20.json',
 'C154945302_Page_21.json',
 'C154945302_Page_22.json',
 'C154945302_Page_23.json',
 'C154945302_Page_24.json',
 'C154945302_Page_25.json',
 'C154945302_Page_26.json',
 'C154945302_Page_27.json',
 'C154945302_Page_28.json',
 'C154945302_Page_29.json',
 'C154945302_Page_3.json',
 'C154945302_Page_30.json',
 'C154945302_Page_31.json',
 'C154945302_Page_32.json',
 'C154945302_Page_33.json',
 'C154945302_Page_34.json',
 'C154945302_Page_35.json',
 'C154945302_Page_36.json',
 'C154945302_Page_37.json',
 'C154945302_Page_38.json',
 'C154945302_Page_39.json',
 'C154945302_Page_4.json',
 'C154945302_Page_40.js

다운받아 놓은 JSON 파일을 이용해 Graph DB에 데이터 첨부

Neo4j Cypher 구문에 대한 간단한 설명 추가

CALL ~ YIELD

apoc.periodic.iterate

apoc.load.json

MERGE ~ SET

In [84]:
for file in directory_list:
    print("File being imported: " + file)
    work_node_creation = (
        "CALL apoc.periodic.iterate(\"CALL apoc.load.json('file:///"
        + file
        + "') YIELD value\",\"MERGE (w:Work {id: value.id}) \
        SET w.cited_by_count = coalesce(value.cited_by_count, ''), \
        w.display_name = coalesce(value.display_name, ''), \
        w.is_paratext = coalesce(value.is_paratext, ''), \
        w.language = coalesce(value.language, ''), \
        w.publication_date = coalesce(value.publication_date, ''), \
        w.publication_year = coalesce(value.publication_year, ''), \
        w.title = coalesce(value.title, ''), \
        w.type = coalesce(value.type, ''), \
        w.is_oa = coalesce(value.is_oa, ''), \
        w.license = coalesce(value.license, ''), \
        w.url = coalesce(value.url, '')\",{ batchSize: 100, \
        parallel: true, retries: 2} ) \
        YIELD batches, total, operations"
    )

    # Uncomment the print command below to view the raw Cypher script used by Neo4j
    # print(work_node_creation)

    record, summary, keys = driver.execute_query(work_node_creation)
    print("Operations executed during file import - " + str(record[0][2]))
    print("File - " + file + " import complete")

print("All works imported")

NameError: name 'directory_list' is not defined

In [None]:
for file in directory_list:
    work_node_creation = (
        "CALL apoc.periodic.iterate(\"CALL apoc.load.json('file:///"
        + file
        + "') YIELD value\",\"MERGE (a:Author {id: value.id}) \
        SET w.cited_by_count = coalesce(value.cited_by_count, ''), \
        w.display_name = coalesce(value.display_name, ''), \
        w.is_paratext = coalesce(value.is_paratext, ''), \
        w.language = coalesce(value.language, ''), \
        w.publication_date = coalesce(value.publication_date, ''), \
        w.publication_year = coalesce(value.publication_year, ''), \
        w.title = coalesce(value.title, ''), \
        w.type = coalesce(value.type, ''), \
        w.url = coalesce(value.url, '')",{ batchSize: 100, \
        parallel: true, retries: 2} ) \
        YIELD batches, total, operations"
    )

    # Uncomment the print command below to view the raw Cypher script used by Neo4j
    # print(work_node_creation)

    record, summary, keys = driver.execute_query(work_node_creation)
    print("Operations executed during file import - " + str(record[0][2]))
    print("File - " + file + " import complete")

print("All works imported")

### LangGraph 구현

In [63]:
from langchain_neo4j import GraphCypherQAChain, Neo4jGraph
from langchain_openai import ChatOpenAI
from langgraph.graph import START, END, StateGraph
from langchain_core.prompts import PromptTemplate


from typing import List, TypedDict
from pydantic import BaseModel
import re
import os

In [72]:
graph = Neo4jGraph(
    os.environ["NEO4J_URL"],
    os.environ["NEO4J_USERNAME"],
    os.environ["NEO4J_PASSWORD"]
)

In [73]:
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    api_key=os.environ["OPENAI_API_KEY"],
)

In [74]:
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

In [75]:
CYPHER_QA_TEMPLATE = """You are an assistant that helps to form nice and human understandable answers.
The information part contains the provided information that you must use to construct an answer.
The provided information is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
Make the answer sound as a response to the question. Do not mention that you based the result on the given information.
Here is an example:

Question: Which managers own Neo4j stocks?
Context:[manager:CTL LLC, manager:JANE STREET GROUP LLC]
Helpful Answer: CTL LLC, JANE STREET GROUP LLC owns Neo4j stocks.

Follow this example when generating answers.
If the provided information is empty, say that you don't know the answer.
Information:
{context}

Question: {question}
Helpful Answer:"""
CYPHER_QA_PROMPT = PromptTemplate(
    input_variables=["context", "question"], template=CYPHER_QA_TEMPLATE
)

In [80]:
chain = GraphCypherQAChain.from_llm(
    llm, graph=graph, verbose=True, qa_prompt=CYPHER_QA_PROMPT, cypher_prompt=CYPHER_GENERATION_PROMPT,
    allow_dangerous_requests = True
)

In [82]:
chain.run("가장 논문을 많이 낸 저자가 누구야")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mcypher
MATCH (a:Author)-[:WROTE]->(w:Work)
RETURN a.display_name AS author, COUNT(w) AS num_papers
ORDER BY num_papers DESC
LIMIT 1
[0m
Full Context:
[32;1m[1;3m[{'author': 'Yoshua Bengio', 'num_papers': 75}][0m

[1m> Finished chain.[0m


'Yoshua Bengio가 가장 논문을 많이 낸 저자입니다.'