# Augmented Document

*Notebook* berisikan bagaimana cara mengekstrak dokumen yang menjadi _Knowledge Graph_ secara manual menggunakan model LLM (Llama3). Dokumen yang diekstrak pada notebook ini adalah artikel wikipedia dengan kata kunci **"Tim Cook"** menggunakan `WikipediaLoader` dari LangChain.

Langkah-Langkah:
- Load dokumen dari WikipediaLoader.
- Lakukan Preprocessing pada dokumen (menghapus _escape character_)
- Bagi dokumen menjadi beberapa _`chunk`_
- (optional) lakukan _summary_ pada `chunk` untuk mendapakan text yang lebih singkat.
- Buat template _'prompt'_ untuk ekstrak text ke dalam format `entities` dan `relations`.
- Lakukan _inference_ ke Model LLM "Llama 3" untuk mengekstrak text menjadi format entities dan relations.
- Ubah hasil `entities` dan `relation` ke dalam `Cyper Query`
- Simpan `cyper query` tersebut ke dalam Neo4j sebagai `Knowlegde Graph`.

## Preparation

- Buat akun GROQ Cloud https://console.groq.com/
- Buat instan baru di Neo4js https://neo4j.com/docs/graph-data-science-client/current/getting-started/

## Install dan import depedencies yang dibutuhkan

In [None]:
# prompt: mount ke gdrive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [None]:
# prompt: install langchain_communitiy, langchain, langchain_core, langchain_groq, dot_env,  library,

!pip install langchain langchain_community langchain-core langchain-groq python-dotenv


Collecting langchain
  Downloading langchain-0.3.1-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.1-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-core
  Downloading langchain_core-0.3.7-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.129-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 

In [None]:
!pip install youtube-transcript-api
!pip install wikipedia
!pip install beautifulsoup4
!pip install pypdf
!pip install neo4j

Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl.metadata (15 kB)
Downloading youtube_transcript_api-0.6.2-py3-none-any.whl (24 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.6.2
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=a1b4c9310032ae3d20f6b707672cc2c92c9e0ed07a8e3f034654e8cdbebf4aa5
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
Collecting pypdf
  Downloading pypdf-5.0.1-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-5.0.1-py3-none-an

In [None]:
import pandas as pd
import json
import os
from dotenv import load_dotenv
from langchain_community.graphs import Neo4jGraph
from langchain_community.chat_models import ChatOllama
from langchain.document_loaders import WikipediaLoader
from langchain_community.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts.chat import (ChatPromptTemplate,HumanMessagePromptTemplate,SystemMessagePromptTemplate)
from langchain import PromptTemplate
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.schema import (SystemMessage,HumanMessage,AIMessage)
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_groq import ChatGroq
from dotenv import load_dotenv
load_dotenv()


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


False

In [None]:
from neo4j.debug import watch

watch("neo4j")

<neo4j.debug.Watcher at 0x7deb9c101510>

Connection to Neo4J.

Neo4j berguna untuk menyimpan Knowledge G

In [None]:
import dotenv
import os
from neo4j import GraphDatabase

# Identitas dari neo4j+groq
NEO4J_URI='neo4j+s://a208f8ae.databases.neo4j.io:7687'
NEO4J_USERNAME='neo4j'
NEO4J_PASSWORD='rAMePwhcSHE8tAHCY50AkcoINxuqxr2ACfYPtXDmcPU'
# AURA_INSTANCEID='a208f8ae'
# AURA_INSTANCENAME='Instance01'
GROQ_API_KEY = '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'

address="neo4j+s://a208f8ae.databases.neo4j.io"
auth=('neo4j', "rAMePwhcSHE8tAHCY50AkcoINxuqxr2ACfYPtXDmcPU")
driver = GraphDatabase.driver(address, auth=auth)

[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 01:48:36,687  [#0000]  _: <POOL> created, routing address IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> created, routing address IPv4Address(('a208f8ae.databases.neo4j.io', 7687))


## Load Data menggunakan WikipediaLoader

Dokumen yang digunakan adalah dokumen yang bersumber dari wikipedia dengan keyword "Tim Cook: dengan menggunakan `WikipediaLoader`.

In [None]:
query = "Tim Cook"
raw_documents = WikipediaLoader(query=query).load()
raw_documents



  lis = BeautifulSoup(html).find_all('li')


[Document(metadata={'title': 'Tim Cook', 'summary': "Timothy Donald Cook (born November 1, 1960) is an American business executive who is the current chief executive officer of Apple Inc. Cook had previously been the company's chief operating officer under its co-founder Steve Jobs. Cook joined Apple in March 1998 as a senior vice president for worldwide operations, and then as vice president for worldwide sales and operations. He was appointed chief executive on August 24, 2011, after Jobs, who had cancer and died later that year, resigned.\nDuring his tenure as the chief executive of Apple and while serving on its board of directors, he has advocated for the political reform of international and domestic surveillance, cybersecurity, national manufacturing, and environmental preservation. Since becoming CEO, Cook has also replaced Jobs's micromanagement with a more liberal style and implemented a collaborative culture at Apple.:\u200a314\u200a\nSince 2011 when he took over Apple, to 2

## Preprocessing Dokumen

Dokumen dari hasil WikipediaLoader masih memiliki beberapa escape character dan '=='. Pada tahap ini, akan membersihkan character tersebut.

In [None]:
filtered_raw_documents = [raw_documents[i] for i in [0,1,4,7,8,9,10,12,13]] #0: Tim Cook (person), 1: Apple (company), 4: Mac (product), 10: Research, 11: Apple Maps, 13: App Store, 7: Apple TV, 8: Steve Jobs, 13: iPhone
docs = " ".join([d.page_content for d in filtered_raw_documents]).replace("\n", "").replace("==", "")
print(docs)

Timothy Donald Cook (born November 1, 1960) is an American business executive who is the current chief executive officer of Apple Inc. Cook had previously been the company's chief operating officer under its co-founder Steve Jobs. Cook joined Apple in March 1998 as a senior vice president for worldwide operations, and then as vice president for worldwide sales and operations. He was appointed chief executive on August 24, 2011, after Jobs, who had cancer and died later that year, resigned.During his tenure as the chief executive of Apple and while serving on its board of directors, he has advocated for the political reform of international and domestic surveillance, cybersecurity, national manufacturing, and environmental preservation. Since becoming CEO, Cook has also replaced Jobs's micromanagement with a more liberal style and implemented a collaborative culture at Apple.: 314 Since 2011 when he took over Apple, to 2020, Cook doubled the company's revenue and profit, and the company

In [None]:
filtered_raw_documents

[Document(metadata={'title': 'Tim Cook', 'summary': "Timothy Donald Cook (born November 1, 1960) is an American business executive who is the current chief executive officer of Apple Inc. Cook had previously been the company's chief operating officer under its co-founder Steve Jobs. Cook joined Apple in March 1998 as a senior vice president for worldwide operations, and then as vice president for worldwide sales and operations. He was appointed chief executive on August 24, 2011, after Jobs, who had cancer and died later that year, resigned.\nDuring his tenure as the chief executive of Apple and while serving on its board of directors, he has advocated for the political reform of international and domestic surveillance, cybersecurity, national manufacturing, and environmental preservation. Since becoming CEO, Cook has also replaced Jobs's micromanagement with a more liberal style and implemented a collaborative culture at Apple.:\u200a314\u200a\nSince 2011 when he took over Apple, to 2

## Chunking The Document

Dokumen yang telah dibersihkan dibagi-bagi lagi kedalam beberapa bagian yang disebut `chunk`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, chunk_overlap=30
)
split_docs = text_splitter.create_documents([docs])
split_docs

[Document(metadata={}, page_content="Timothy Donald Cook (born November 1, 1960) is an American business executive who is the current chief executive officer of Apple Inc. Cook had previously been the company's chief operating officer under its co-founder Steve Jobs. Cook joined Apple in March 1998 as a senior vice president for worldwide operations, and then as vice president for worldwide sales and operations. He was appointed chief executive on August 24, 2011, after Jobs, who had cancer and died later that year, resigned.During his tenure as the chief executive of Apple and while serving on its board of directors, he has advocated for the political reform of international and domestic surveillance, cybersecurity, national manufacturing, and environmental preservation. Since becoming CEO, Cook has also replaced Jobs's micromanagement with a more liberal style and implemented a collaborative culture at Apple.:\u200a314\u200aSince 2011 when he took over Apple, to 2020, Cook doubled th

## Summary Text

Melakukan summary text menggunakan Mixtral model, supaya lebih sedikit. Namun jika ingin menggunakan full text, boleh skip kode ini.

In [None]:
from langchain_groq import ChatGroq


os.environ["GROQ_API_KEY"] = ""

In [None]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter


llm = ChatGroq(temperature=0, model_name="mixtral-8x7b-32768") # Define the mistral model

# Define the map prompt template
map_template = """The following is a set of documents
{all_data}
Based on this list of docs, please find the important information from it (focus on entities and relationship)
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)

# Define the map_chain
map_chain = LLMChain(llm=llm, prompt=map_prompt)

reduce_template = """The following is set of summaries:
{all_data}
Take these and distill it into a final, consolidated summary of the main themes. In one final paragraph
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain,
    document_variable_name="all_data"  # This should match the variable name in reduce_prompt
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=1024,
)

# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="all_data",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)


# Run the MapReduce Chain
summarization_results = map_reduce_chain.run(split_docs)

  combine_documents_chain = StuffDocumentsChain(
  reduce_documents_chain = ReduceDocumentsChain(
  map_reduce_chain = MapReduceDocumentsChain(
  summarization_results = map_reduce_chain.run(split_docs)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (5857 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
import textwrap
print(textwrap.fill(summarization_results, 100))

The text primarily revolves around three main themes: prominent figures in technology and military
history, the development and impact of Apple's App Store, and the evolution of Apple's iPhone.
Firstly, it discusses Tim Cook's significant contributions to Canadian military history through his
authorship of thirteen books, earning prestigious awards. In contrast, Tim Higgins' book about
Tesla, Inc. and its CEO, Elon Musk, has received mixed reviews. The text also highlights the
achievements of Apple co-founder Steven Paul Jobs, who played a pivotal role in technology and
innovation, and posthumously received the Presidential Medal of Freedom in 2022. Secondly, the text
explores Apple's App Store, which has faced criticism for its monopolistic nature despite developers
earning over $155 billion. Apple's legal case against Amazon over the use of the term "App Store"
and the company's efforts to improve its mapping services, such as Apple Maps, are also discussed.
Lastly, the text covers t

In [None]:
file_path = "/content/drive/MyDrive/Dokument-Graph-RAG/raw_data/summary.txt"

with open(file_path, 'a') as file:
    file.write(summarization_results)

## Creating Prompt Template for Extracting Text

In [None]:
print(GROQ_API_KEY)

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


In [None]:
from langchain_groq import ChatGroq


os.environ["GROQ_API_KEY"] = "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"

Saat ini untuk mendapatkan `entity_types` dan `relation_types` masih dilakukan secara manual. Sepertinya ada cara yang lebih efisien yaitu menggunakan Large Langguage Model (LLM)

In [None]:
entity_types = ['person','school','award','company','product','characteristic']
relation_types = ['alumniOf','worksFor','hasAward','isProducedBy','hasCharacteristic','acquired','hasProject','isFounderOf']

system_prompt = PromptTemplate(
    template = """
    You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
    Your task is to identify the entities and relations requested with the user prompt, from a given text.
    You must generate the output in a JSON containing a list with JSON objects having the following keys: "head", "head_type", "relation", "tail", and "tail_type".
    The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt.
    The "head_type" key must contain the type of the extracted head entity which must be one of the types from {entity_types}.
    The "relation" key must contain the type of relation between the "head" and the "tail" which must be one of the relations from {relation_types}.
    The "tail" key must represent the text of an extracted entity which is the tail of the relation, and the "tail_type" key must contain the type of the tail entity from {entity_types}.
    Attempt to extract as many entities and relations as you can.

    IMPORTANT NOTES:
    - Don't add any explanation and text.
    """,
    input_variables=["entity_types","relation_types"],
)


system_message_prompt = SystemMessagePromptTemplate(prompt = system_prompt)

examples = [
        {
            "text":"Adam is a software engineer in Microsoft since 2009, and last year he got an award as the Best Talent" ,
            "head": "Adam",
            "head_type": "person",
            "relation": "worksFor",
            "tail": "Microsoft",
            "tail_type": "company"
        },
        {
            "text":"Adam is a software engineer in Microsoft since 2009, and last year he got an award as the Best Talent" ,
            "head": "Adam",
            "head_type": "person",
            "relation": "hasAward",
            "tail": "Best Talent",
            "tail_type": "award"
        },
        {
            "text":"Microsoft is a tech company that provide several products such as Microsoft Word" ,
            "head": "Microsoft Word",
            "head_type": "product",
            "relation": "isproducedBy",
            "tail": "Microsoft",
            "tail_type": "company"
        },
        {
            "text":"Microsoft Word is a lightweight app that accessible offline" ,
            "head": "Microsoft Word",
            "head_type": "product",
            "relation": "hasCharacteristic",
            "tail": "lightweight app",
            "tail_type": "characteristic"
        },
        {
            "text":"Microsoft Word is a lightweight app that accessible offline" ,
            "head": "Microsoft Word",
            "head_type": "product",
            "relation": "hasCharacteristic",
            "tail": "accesible offline",
            "tail_type": "characteristic"
        },
    ]

class ExtractedInfo(BaseModel):
    head: str = Field(description="extracted first or head entity like Microsoft, Apple, John")
    head_type: str = Field(description="type of the extracted head entity like person, company, etc")
    relation: str = Field(description="relation between the head and the tail entities")
    tail: str = Field(description="extracted second or tail entity like Microsoft, Apple, John")
    tail_type: str = Field(description="type of the extracted tail entity like person, company, etc")

parser = JsonOutputParser(pydantic_object=ExtractedInfo)

human_prompt = PromptTemplate(
    template = """ Based on the following example, extract entities and relations from the provided text.\n\n

    Use the following entity types, don't use other entity that is not defined below:
    # ENTITY TYPES:
    {entity_types}

    Use the following relation types, don't use other relation that is not defined below:
    # RELATION TYPES:
    {relation_types}

    Below are a number of examples of text and their extracted entities and relationshhips.
    {examples}

    For the following text, generate extract entitites and relations as in the provided example.\n{format_instructions}\nText: {text}""",
    input_variables=["entity_types","relation_types","examples","text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

human_message_prompt = HumanMessagePromptTemplate(prompt=human_prompt)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])


# model = ChatOllama(model = "mistral",temperature=0)
# model = ChatOllama(model = "llama3",temperature=0)
model_name = "llama3-70b-8192"
model = ChatGroq(temperature=0, model_name=model_name)
chain = LLMChain(llm=model, prompt=chat_prompt)

In [None]:
parser.get_format_instructions()

'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{"properties": {"head": {"title": "Head", "description": "extracted first or head entity like Microsoft, Apple, John", "type": "string"}, "head_type": {"title": "Head Type", "description": "type of the extracted head entity like person, company, etc", "type": "string"}, "relation": {"title": "Relation", "description": "relation between the head and the tail entities", "type": "string"}, "tail": {"title": "Tail", "description": "extracted second or tail entity like Microsoft, Apple, John", "type": "string"}, "tail_type": {"title": "Tail

## Melakukan Inference pada Model LLM

In [None]:
file_path = "/content/drive/MyDrive/Dokument-Graph-RAG/clean_summary.txt"
with open(file_path, 'r') as file:
    # Read the entire file contents into a string
    file_contents = file.read()

# Split the file contents into sentences
sentences = file_contents.split('. ')

result = []
# Iterate over each sentence
for sentence in sentences:
    # Process each sentence
    response  = chain.run(entity_types = entity_types, relation_types = relation_types, examples = examples, text = sentence)
    print(response)
    try:
        result.extend(eval(response))
    except:
        pass

[
    {
        "head": "Tim Cook",
        "head_type": "person",
        "relation": "worksFor",
        "tail": "Apple",
        "tail_type": "company"
    }
]
[
    {
        "head": "He",
        "head_type": "person",
        "relation": "worksFor",
        "tail": "Apple",
        "tail_type": "company"
    },
    {
        "head": "iPod Nano",
        "head_type": "product",
        "relation": "isProducedBy",
        "tail": "Apple",
        "tail_type": "company"
    },
    {
        "head": "iPhone",
        "head_type": "product",
        "relation": "isProducedBy",
        "tail": "Apple",
        "tail_type": "company"
    },
    {
        "head": "iPad",
        "head_type": "product",
        "relation": "isProducedBy",
        "tail": "Apple",
        "tail_type": "company"
    }
]
[
    {"head": "Cook", "head_type": "person", "relation": "alumniOf", "tail": "Auburn University", "tail_type": "school"},
    {"head": "Cook", "head_type": "person", "relation": "alumniOf",

In [None]:
result

[{'head': 'Tim Cook',
  'head_type': 'person',
  'relation': 'worksFor',
  'tail': 'Apple',
  'tail_type': 'company'},
 {'head': 'He',
  'head_type': 'person',
  'relation': 'worksFor',
  'tail': 'Apple',
  'tail_type': 'company'},
 {'head': 'iPod Nano',
  'head_type': 'product',
  'relation': 'isProducedBy',
  'tail': 'Apple',
  'tail_type': 'company'},
 {'head': 'iPhone',
  'head_type': 'product',
  'relation': 'isProducedBy',
  'tail': 'Apple',
  'tail_type': 'company'},
 {'head': 'iPad',
  'head_type': 'product',
  'relation': 'isProducedBy',
  'tail': 'Apple',
  'tail_type': 'company'},
 {'head': 'Cook',
  'head_type': 'person',
  'relation': 'alumniOf',
  'tail': 'Auburn University',
  'tail_type': 'school'},
 {'head': 'Cook',
  'head_type': 'person',
  'relation': 'alumniOf',
  'tail': 'Duke University',
  'tail_type': 'school'},
 {'head': 'Tim Cook',
  'head_type': 'person',
  'relation': 'hasAward',
  'tail': 'Financial Times Person of the Year',
  'tail_type': 'award'},
 {'he

## Convert to Cypher Query

In [None]:
with open("/content/drive/MyDrive/Dokument-Graph-RAG/clean_result.txt", "r") as file:
    content = file.read()
entity_relations = eval(content)
print(entity_relations)

[{'head': 'Tim Cook', 'head_type': 'person', 'relation': 'worksFor', 'tail': 'Apple', 'tail_type': 'company'}, {'head': 'Tim Cook', 'head_type': 'person', 'relation': 'led', 'tail': 'inventory reduction measures', 'tail_type': 'characteristic'}, {'head': 'Tim Cook', 'head_type': 'person', 'relation': 'led', 'tail': 'long-term investments in flash memory', 'tail_type': 'characteristic'}, {'head': 'iPod Nano', 'head_type': 'product', 'relation': 'isProducedBy', 'tail': 'Apple', 'tail_type': 'company'}, {'head': 'iPhone', 'head_type': 'product', 'relation': 'isProducedBy', 'tail': 'Apple', 'tail_type': 'company'}, {'head': 'iPad', 'head_type': 'product', 'relation': 'isProducedBy', 'tail': 'Apple', 'tail_type': 'company'}, {'head': 'iPod Nano', 'head_type': 'product', 'relation': 'isProducedBy', 'tail': 'Apple', 'tail_type': 'company'}, {'head': 'Tim Cook', 'head_type': 'person', 'relation': 'alumniOf', 'tail': 'Auburn University', 'tail_type': 'school'}, {'head': 'Tim Cook', 'head_type':

In [None]:
df = pd.DataFrame(entity_relations)
df

Unnamed: 0,head,head_type,relation,tail,tail_type
0,Tim Cook,person,worksFor,Apple,company
1,Tim Cook,person,led,inventory reduction measures,characteristic
2,Tim Cook,person,led,long-term investments in flash memory,characteristic
3,iPod Nano,product,isProducedBy,Apple,company
4,iPhone,product,isProducedBy,Apple,company
5,iPad,product,isProducedBy,Apple,company
6,iPod Nano,product,isProducedBy,Apple,company
7,Tim Cook,person,alumniOf,Auburn University,school
8,Tim Cook,person,alumniOf,Duke University,school
9,Tim Cook,person,hasAward,Financial Times Person of the Year,award


In [None]:
unique_entities = set()
for item in entity_relations:
    unique_entities.add((item['head'], item['head_type']))
    unique_entities.add((item['tail'], item['tail_type']))

unique_entities_list = list(unique_entities)
print(unique_entities_list)

[("Fortune's World's Greatest Leader", 'award'), ('Apple', 'company'), ('Project Titan', 'project'), ('App Store', 'product'), ('Steve Jobs', 'person'), ('iPhone', 'product'), ('technology', 'characteristic'), ('Duke University', 'school'), ('electric and self-driving car technology', 'characteristic'), ('Apple I', 'product'), ('Apple II', 'product'), ('Apple Maps', 'product'), ('trailblazing technology company', 'characteristic'), ('inventory reduction measures', 'characteristic'), ('iPad', 'product'), ('Financial Times Person of the Year', 'award'), ('multi-touch technology', 'characteristic'), ('Touch ID', 'characteristic'), ('Lisa', 'product'), ('iPod Nano', 'product'), ('long-term investments in flash memory', 'characteristic'), ('Cingular', 'company'), ('Placebase', 'company'), ('Ripple of Change Award', 'award'), ('NeXT', 'company'), ('Tim Cook', 'person'), ('Steve Wozniak', 'person'), ('Auburn University', 'school'), ('graphical user interface-based system', 'characteristic'), 

In [None]:
with open("cypher_query.txt", "a") as file:
    for item in unique_entities_list:
        label, entity = item
        id = label.replace(" ","_").replace("-","").replace("'","").lower()
        merge_statement = f"""MERGE ({id}:{entity} {{id: "{label}"}})\n"""
        file.write(merge_statement)

In [None]:
# membuat cypher query
with open("cypher_query.txt", "a") as file:
    for item in entity_relations:
        head = item['head'].replace(" ","_").replace("-","").replace("'","").lower()
        tail = item['tail'].replace(" ","_").replace("-","").replace("'","").lower()
        cypher = f"""MERGE ({head})-[:{item['relation']}]->({tail})\n"""
        file.write(cypher)
        print(cypher)

MERGE (tim_cook)-[:worksFor]->(apple)

MERGE (tim_cook)-[:led]->(inventory_reduction_measures)

MERGE (tim_cook)-[:led]->(longterm_investments_in_flash_memory)

MERGE (ipod_nano)-[:isProducedBy]->(apple)

MERGE (iphone)-[:isProducedBy]->(apple)

MERGE (ipad)-[:isProducedBy]->(apple)

MERGE (ipod_nano)-[:isProducedBy]->(apple)

MERGE (tim_cook)-[:alumniOf]->(auburn_university)

MERGE (tim_cook)-[:alumniOf]->(duke_university)

MERGE (tim_cook)-[:hasAward]->(financial_times_person_of_the_year)

MERGE (tim_cook)-[:hasAward]->(ripple_of_change_award)

MERGE (tim_cook)-[:hasAward]->(fortunes_worlds_greatest_leader)

MERGE (apple)-[:isFoundedBy]->(steve_wozniak)

MERGE (apple)-[:isFoundedBy]->(steve_jobs)

MERGE (apple_i)-[:isProducedBy]->(apple)

MERGE (apple_ii)-[:isProducedBy]->(apple)

MERGE (lisa)-[:isProducedBy]->(apple)

MERGE (macintosh)-[:isProducedBy]->(apple)

MERGE (steve_jobs)-[:worksFor]->(apple)

MERGE (steve_jobs)-[:left]->(apple)

MERGE (apple)-[:acquired]->(next)

MERGE (nex

## Save Cyper Query ke Neo4j sebagai Knowledge Graph

In [None]:
!pip install graphdatascience

Collecting graphdatascience
  Downloading graphdatascience-1.11-py3-none-any.whl.metadata (7.4 kB)
Collecting multimethod<2.0,>=1.0 (from graphdatascience)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting textdistance<5.0,>=4.0 (from graphdatascience)
  Downloading textdistance-4.6.3-py3-none-any.whl.metadata (18 kB)
Downloading graphdatascience-1.11-py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multimethod-1.12-py3-none-any.whl (10 kB)
Downloading textdistance-4.6.3-py3-none-any.whl (31 kB)
Installing collected packages: textdistance, multimethod, graphdatascience
Successfully installed graphdatascience-1.11 multimethod-1.12 textdistance-4.6.3


In [None]:
# SETUP ENV and NEO4J

NEO4J_URI=''
NEO4J_USERNAME=''
NEO4J_PASSWORD=''
# AURA_INSTANCEID=''
# AURA_INSTANCENAME=''
GROQ_API_KEY = ''
# Neo4j
neo4j_url = os.getenv(key=NEO4J_URI)
neo4j_user = os.getenv(key=NEO4J_USERNAME)
neo4j_password = os.getenv(key=NEO4J_PASSWORD)


# https://api.python.langchain.com/en/latest/graphs/langchain_community.graphs.neo4j_graph.Neo4jGraph.html
graph = Neo4jGraph(NEO4J_URI,NEO4J_USERNAME,NEO4J_PASSWORD)


[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:40,413  [#0000]  _: <POOL> created, routing address IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> created, routing address IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:40,420  [#0000]  _: <WORKSPACE> resolve home database
DEBUG:neo4j:[#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:40,425  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:40,430  [#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> in: a208f8ae.databas

In [None]:
graph.refresh_schema()
print(graph.schema)

[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:47,719  [#0000]  _: <WORKSPACE> resolve home database
DEBUG:neo4j:[#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:47,740  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:47,762  [#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:47,778  [#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
[DEBUG   ] [Thread 138451126087680] [Task 9554669820822

Node properties:
person {id: STRING}
product {id: STRING}
award {id: STRING}
characteristic {id: STRING}
company {id: STRING}
school {id: STRING}
project {id: STRING}
Relationship properties:

The relationships:
(:person)-[:worksFor]->(:company)
(:person)-[:left]->(:company)
(:person)-[:led]->(:characteristic)
(:person)-[:hasAward]->(:award)
(:person)-[:alumniOf]->(:school)
(:product)-[:isProducedBy]->(:company)
(:product)-[:hasCharacteristic]->(:characteristic)
(:company)-[:isFoundedBy]->(:person)
(:company)-[:acquired]->(:company)
(:company)-[:hasCharacteristic]->(:characteristic)
(:company)-[:collaboratedWith]->(:company)
(:company)-[:operates]->(:product)
(:company)-[:hasProject]->(:product)
(:company)-[:hasProject]->(:project)
(:company)-[:hasTechnology]->(:characteristic)
(:project)-[:hasCharacteristic]->(:characteristic)


In [None]:
with open("/content/drive/MyDrive/Dokument-Graph-RAG/cypher_query.txt", "r") as file:
    queries = file.read()

graph.query(queries)

[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:52,818  [#0000]  _: <WORKSPACE> resolve home database
DEBUG:neo4j:[#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:52,834  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:52,845  [#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:52,870  [#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
[DEBUG   ] [Thread 138451126087680] [Task 9554669820822

[]

In [None]:
graph.refresh_schema()
print(graph.schema)

[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:54,818  [#0000]  _: <WORKSPACE> resolve home database
DEBUG:neo4j:[#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:54,841  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
DEBUG:neo4j.pool:[#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a208f8ae.databases.neo4j.io', 7687))
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:54,846  [#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> in: a208f8ae.databases.neo4j.io:7687
[DEBUG   ] [Thread 138451126087680] [Task 95546698208224 ] 2024-10-02 02:20:54,856  [#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
DEBUG:neo4j.io:[#0000]  _: <RESOLVE> dns resolver out: 34.126.114.186:7687
[DEBUG   ] [Thread 138451126087680] [Task 9554669820822

Node properties:
person {id: STRING}
product {id: STRING}
award {id: STRING}
characteristic {id: STRING}
company {id: STRING}
school {id: STRING}
project {id: STRING}
Relationship properties:

The relationships:
(:person)-[:worksFor]->(:company)
(:person)-[:left]->(:company)
(:person)-[:led]->(:characteristic)
(:person)-[:hasAward]->(:award)
(:person)-[:alumniOf]->(:school)
(:product)-[:isProducedBy]->(:company)
(:product)-[:hasCharacteristic]->(:characteristic)
(:company)-[:isFoundedBy]->(:person)
(:company)-[:acquired]->(:company)
(:company)-[:hasCharacteristic]->(:characteristic)
(:company)-[:collaboratedWith]->(:company)
(:company)-[:operates]->(:product)
(:company)-[:hasProject]->(:product)
(:company)-[:hasProject]->(:project)
(:company)-[:hasTechnology]->(:characteristic)
(:project)-[:hasCharacteristic]->(:characteristic)
