## **1. From vectors to graphs**

The RAG architectures we've explored involve embedding a user input and querying a vector store to return relevant documents based on their semantic similarity. Although powerful, this approach does have some limitations. 

Firstly, document embedding captures semantic meaning but struggles to capture themes and relationships between entities in the document corpus.

Moreover, as the volume of the database grows, the retrieval process can become less efficient, as the computational load increases with the search space.

Lastly, vector RAG systems don't easily accommodate structured or diverse data, which are harder to embed.

<img src='./images/vector-rag-limitations.png' width=50%>

###  **Graph Databases**

We can address all of those challenges with graphs. Graphs are great at representing and storing diverse and interconnected information in a structured manner. 

Entities, like people, places, and sports teams We can address all of those challenges with graphs. Graphs are great at representing and storing diverse and interconnected information in a structured manner. Entities, like people, places, and sports teams are represented by labeled edges.

Notice that edges are directional, so relationships can apply from one entity to another, but not necessarily the other way around. We'll look at this more closely in a later video.

<img src='./images/graphdb.png' width=50%>

### **Neo4j graph databases**

Neo4j is a powerful graph database option designed to store and efficiently query complex relationships.

Our entities are represented as nodes, where the color indicates the entity type, such as a person. The relationships are represented by edges with types like LOCATED and INTERESTED. 

Each node also has a type and unique identifier. Nodes can contain any number of properties represented as key:value pairs.

<img src='./images/graph-neo4j.png' width=50%>

### **Loading and chunking Wikipedia pages**

So how do we go from the unstructured text data we've seen to a nice structured graph? There's a few ways to do this, but we'll be using LLMs. Let's load the Wikipedia results from searching `"large language model"` using the `WikipediaLoader` class, and split the first few documents into chunks. Each document has page content and metadata as seen below.

In [4]:
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter

raw_documents = WikipediaLoader(query="large language model").load()
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents[:3])

print(documents[0])

page_content='A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
The largest and most capable LLMs are generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ont' metadata={'title': 'Large language model', 'summary': 'A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.\nThe largest and most capable LLMs are generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided by prompt engineering. Thes

### **From text to graph**

We begin by defining the LLM to use for the transformation, and use it to create an `LLMGraphTransformer`. Note that we use `temperature=0` to produce more deterministic graphs for greater reliability. The LLM creates structured graph documents by parsing and categorizing entities and their relationships, which it infers from the documents. The transformation is performed using the `.convert_to_graph_documents()` method on the documents.

In [5]:
from langchain_openai import ChatOpenAI
from langchain_experimental.graph_transformers import LLMGraphTransformer
import os

api_key = os.environ.get("OPENAI_API_KEY")

llm= ChatOpenAI(api_key=api_key, temperature=0, model_name="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(graph_documents)

[GraphDocument(nodes=[Node(id='Large Language Model', type='Machine learning model', properties={}), Node(id='Natural Language Processing', type='Task', properties={}), Node(id='Language Generation', type='Task', properties={}), Node(id='Parameters', type='Concept', properties={}), Node(id='Self-Supervised Learning', type='Learning method', properties={}), Node(id='Text', type='Data', properties={}), Node(id='Generative Pretrained Transformers', type='Model', properties={}), Node(id='Gpts', type='Model', properties={}), Node(id='Specific Tasks', type='Task', properties={}), Node(id='Prompt Engineering', type='Technique', properties={}), Node(id='Predictive Power', type='Concept', properties={}), Node(id='Syntax', type='Concept', properties={}), Node(id='Semantics', type='Concept', properties={}), Node(id='Ontology', type='Concept', properties={})], relationships=[Relationship(source=Node(id='Large Language Model', type='Machine learning model', properties={}), target=Node(id='Natural L

In the output, we can see how the model inferred many entities from the text and created nodes with ids and types to match. Relationships between the entities were also inferred and mapped using edges going from a source node to a target node.

In [3]:
from langchain.document_loaders import TextLoader

# Load the document directly from the file
loader = TextLoader("./datasets/famous_scientists.txt")
docs = loader.load()

# Define the LLM
llm = ChatOpenAI(api_key=api_key, model="gpt-4o-mini", temperature=0)

# Instantiate the LLM graph transformer
llm_transformer = LLMGraphTransformer(llm=llm)

# Convert the text documents to graph documents
graph_documents = llm_transformer.convert_to_graph_documents(docs)
print(f"Derived Nodes:\n{graph_documents[0].nodes}\n")
print(f"Derived Edges:\n{graph_documents[0].relationships}")

Derived Nodes:
[Node(id='Albert Einstein', type='Person', properties={}), Node(id='Marie Curie', type='Person', properties={}), Node(id='Nobel Prize In Physics', type='Award', properties={}), Node(id='Nobel Prize In Chemistry', type='Award', properties={}), Node(id='Theory Of Relativity', type='Concept', properties={}), Node(id='Photoelectric Effect', type='Concept', properties={}), Node(id='Radioactivity', type='Concept', properties={}), Node(id='Radiation', type='Concept', properties={}), Node(id='Radium', type='Element', properties={}), Node(id='Polonium', type='Element', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='Henri Becquerel', type='Person', properties={})]

Derived Edges:
[Relationship(source=Node(id='Albert Einstein', type='Person', properties={}), target=Node(id='Theory Of Relativity', type='Concept', properties={}), type='KNOWN_FOR', properties={}), Relationship(source=Node(id='Albert Einstein', type='Person', properties={}), target=Node

## **2. Storing and querying documents**

We'll be using `Neo4j` to store and query our graph documents. Neo4j has both cloud-based and local verisons to suit your use cases. 

__Note 1__: _Firstly, `pip install neo4j`. To download and use locally, visit [here](https://neo4j.com/download/). Download to the 'Neo4j Server'. Then download apoc that is compatible with downloaded Neo4j Server version from [here](https://github.com/neo4j/apoc/releases). It is very important to follow installation instructions provided by Neo4j._

__Note 2__: `from langchain_community.graphs import Neo4jGraph` no longer works. Instead, you need to install `langchain-neo4j` and import `Neo4jGraph` from there. Then you are good to go.

For our purposes, we'll assume that database is already locally available.

We instantiate a graph with the Neo4jGraph class, specifying the URL to the Neo4j database server, and the credentials needed to access it. In a production setting, these credentials should be saved as environment variables rather than being committed to a codebase for better security.

In [3]:
from langchain_neo4j import Neo4jGraph
import os

# Get the environment variables for Neo4j connection
url = os.environ["NEO4J_URI"]
user = os.environ["NEO4J_USERNAME"]
password = os.environ["NEO4J_PASSWORD"]

# Create a Neo4jGraph instance
graph = Neo4jGraph(url=url, username=user, password=password)

### **Storing graph documents**

Carrying on from the graph documents we created from Wikipedia results on large language models, we can add these graph documents to our database using the `.add_graph_documents()` method. 

- `include_source` link nodes to source documents with `MENTIONS` edge, enabling better traceability and context preservation.
- `baseEntityLabel` add `__Entity__` label to each node, improving query performance.

In [7]:
from langchain_experimental.graph_transformers import LLMGraphTransformer

llm = ChatOpenAI(api_key=api_key, temperature=0, model="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(llm=llm)

graph_documents = llm_transformer.convert_to_graph_documents(documents)

graph.add_graph_documents(
    graph_documents,
    include_source=True,
    baseEntityLabel=True
)

Here is what our graph looks like:

<img src='./images/graph.jpeg' width=75%>

Can play with the nodes and edges in the Neo4j browser - the code `MATCH (n) RETURN n LIMIT 40;` should be written in the query box to see the nodes.

Red nodes are the source documents we specified when adding the graph documents; each one has a MENTIONS relationship from the source to the entity mentioned.

We can also view the database schema with the `.get_schema` attribute. Here, we can see the different node types and relationships, including their direction:

In [8]:
print(graph.get_schema)

Node properties:
Person {name: STRING, age: INTEGER}
Relationship properties:

The relationships:



In [11]:
# Check if nodes were created
result = graph.query("MATCH (n) RETURN count(n)")
print(result)

[{'count(n)': 180}]


### **Querying Neo4j - Cypher Query Language**

Neo4j introduced Cypher Query Language in 2011 as a declarative query language for intuitively navigating and manipulating graph data using a SQL-like syntax.

<img src='./images/cypher.png' width=50%>

### **Querying the LLM graph**

Let's query our database of Wikipedia results about LLMs.

We'll query the database to find out who developed the GPT-4 model. Our query looks for a match between a model and an organization joined by the `DEVELOPED_BY` relationship. 

In [21]:
results = graph.query(
    """MATCH (gpt4:Model {id: "Gpt-4"})-[:DEVELOPED_BY]->(org:Organizatiion)
    RETURN org
    """
)

print(results)



[]


## **3. Creeating the Graph RAG chain**

