# Prerequisites

# 1. Conda Environment


    Please create a Python 3.10-3.12 Conda environment with below commands. 

        conda create -n graph python=3.11 -y && conda activate graph

        pip install graphrag

        conda activate graph

    We will use the "graph" Conda environment to run the below notebook and create the graph indexes.

# 2. Input Data

    The text file need to be placed inside an **input** folder so that the Graph init command can read the text files

        - The file location for this notebook is a local drive

    You can also use the Microsoft Fabric OneLake location by following below steps 

        - Download and Install the Onelake file explorere from https://www.microsoft.com/en-us/download/details.aspx?id=105222 

        - Once installed you will be able to browse to your Fabric workspace and oneLake

        - Copy the text file which you want to be gaphed inside the Files folder in oneLake

# GraphRag Initialization

Use init Command to initialize the Grapgh , in our case the text files are inside **/input** folder

init command will craete .env and settings.yaml files and a prompts folder with the default prompts to extract the graph entities

In [None]:
# assumng that the current directory is the root of the project having input folder
!python -m graphrag.index --init --root ./
#!python -m graphrag.index --init --root "C:\Users\safdarzaman\OneLake - Microsoft\My workspace\yellowtaxidatalakehouse.Lakehouse\Files\Ragdata"

Prior to running the indexing job in next cell, please update the llm, embeddings and snapshots sections in settings.yaml

        llm:
            api_key: <api key for Azure Open AI model>
            type: azure_openai_chat
            model: gpt-4o
            model_supports_json: true # recommended if this is available for your model.
            max_tokens: 4096
            request_timeout: 180.0
            api_base: https://xxxx.openai.azure.com
            api_version: 2024-02-15-preview
            organization: Microsoft
            deployment_name: <Name of your deployment>

        embeddings:
        ## parallelization: override the global parallelization settings for embeddings
        async_mode: threaded # or asyncio
        # target: required # or all
            llm:
                api_key: <api key for Azure Open AI model>
                type: azure_openai_embedding
                model: text-embedding-3-large
                api_base: https://xxxx.openai.azure.com
                api_version: 2024-02-15-preview
                organization: Microsoft
                deployment_name: <Name of your deployment>

        snapshots:
                graphml: true
                raw_entities: true
                top_level_nodes: true

        
Update the .env file and provide the azure open ai api key in the file 

            
Note: setting gaphml parameter to true to create the graph for visualization   


# Graph Indexing

graph.index pipeline will create the  parquet files under the output folder with entities, relatiosnships and community reports 

In [None]:
# assumng that the current directory is the root of the project having input folder
!python -m graphrag.index --root ./
#uncomemnt the below for blob storage or Microsoft OneLake 
#!python -m graphrag.index --root "C:\Users\safdarzaman\OneLake - Microsoft\My workspace\yellowtaxidatalakehouse.Lakehouse\Files\Ragdata"

After successful running of the index pipeline you will have multiple parquet and graphml files under artifacts folder

# Visualize the Graph

In [None]:
import glob
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# Load the GraphML file
graph = nx.read_graphml(glob.glob("./output/20241002-102507/artifacts/merged_graph.graphml")[0])

fig=plt.figure(figsize=(20, 20))
# Extract node types
node_types = nx.get_node_attributes(graph, 'type')

# Define a color map for the node types
unique_types = list(set(node_types.values()))
color_map = {node_type: plt.cm.tab20(i / len(unique_types)) for i, node_type in enumerate(unique_types)}

# Assign colors to nodes based on their type
node_colors = [color_map[node_types[node]] if node in node_types else "r" for node in graph.nodes]

# Calculate PageRank
pagerank = nx.pagerank(graph)

# Assign node sizes based on PageRank
#node_sizes = [pagerank[node] * 10000 for node in graph.nodes]

# Calculate node degree
node_degree = dict(graph.degree())

# Assign node sizes based on node degree
node_sizes = [node_degree[node]*60 for node in graph.nodes]

# Assign font sizes based on node degree
font_sizes = {node:np.sqrt(node_degree[node]) * 0.15 for node in graph.nodes}

pos = nx.spring_layout(graph, k=0.20, iterations=20)
#nx.draw(graph, pos, with_labels=True, font_size=font_sizes, font_color='black', font_weight='bold', node_color=node_colors, node_size=node_sizes, edge_color='gray', linewidths=0.5, alpha=0.9)
nx.draw(graph, pos, with_labels=False, node_color=node_colors, node_size=node_sizes, edge_color='gray', linewidths=0.5, alpha=0.9)
for node, (x, y) in pos.items():
    plt.text(x, y, node, fontsize=font_sizes[node]*5, ha='center', va='center')
fig.set_facecolor("black")
plt.show()

# Run queries on the created Graph entities 

In [None]:
%pip install graphrag
%pip install azure-core
%pip install langchain
%pip install tiktoken
%pip install langchain-openai

In [175]:
import os
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
import tiktoken
import pandas as pd

from graphrag.query.indexer_adapters import read_indexer_entities, read_indexer_reports
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding

from graphrag.query.llm.oai.typing import OpenaiApiType

from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch

from graphrag.query.structured_search.local_search.mixed_context import (
    LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
    read_indexer_covariates,
    read_indexer_entities,
    read_indexer_relationships,
    read_indexer_reports,
    read_indexer_text_units,
)


from graphrag.query.input.loaders.dfs import (
    store_entity_semantic_embeddings,
)

Provide the LLM configuration which will be used by Graphrag at query time 

In [176]:
llm = ChatOpenAI(
    api_key="XXXXXXXXX",
    model="gpt4o",
    api_type=OpenaiApiType.AzureOpenAI,
    max_retries=20,
    api_base="https://XXXXXXXr.openai.azure.com",
    api_version="2024-02-15-preview",
)

text_embedder = OpenAIEmbedding(
    api_key="XXXXXX",
    api_base="https://XXXXXXX.openai.azure.com",
    api_type=OpenaiApiType.AzureOpenAI,
    model="text-embedding-3-large",
    deployment_name="text-embedding-3-large",
    max_retries=20,
    api_version="2024-02-15-preview",
)

token_encoder = tiktoken.get_encoding("cl100k_base")

Provide the path of artifacts folder which has the Parquet files 
Load the files in Pandas data frame

In [177]:
#INPUT_DIR = "C:/Users/safdarzaman/OneLake - Microsoft/My workspace/yellowtaxidatalakehouse.Lakehouse/Files/Ragdata/output/20240925-113329/artifacts"

INPUT_DIR = "./output/20241002-102507/artifacts"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"




# parquet files generated from indexing pipeline
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_TABLE = "create_final_communities"

# community level in the Leiden community hierarchy from which we will load the community reports
# higher value means we use reports from more fine-grained communities (at the cost of higher computation cost)
COMMUNITY_LEVEL = 1



# read nodes table to get community and degree data

entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
community_df= pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")


reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
relationships = read_indexer_relationships(relationship_df)
text_units = read_indexer_text_units(text_unit_df)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
    collection_name="entity_description_embeddings",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)

Print the count of Graph entities, relationships, text units and reports

In [None]:
print(f"Entity count: {len(entity_df)}")
print(f"Relationship count: {len(relationship_df)}")

print(f"Text unit records: {len(text_unit_df)}")

print(f"Total report count: {len(report_df)}")

print(
    f"Report count after filtering by community level {COMMUNITY_LEVEL}: {len(reports)}"
)


# Let us see the Data Frames which are from our Parquet files

In [None]:
#embedding dataframe
entity_embedding_df.head()


In [None]:
#relationship dataframe
relationship_df.head(5)

In [None]:
#textunit data frame
text_unit_df.head()

In [None]:
#report dataframe
report_df.head()

# Run Global Search on data 

In [157]:
# Global serach is used to answer questions which need the complete dataset understanding for questions like "What are top 10 themes in articles"
context_builder = GlobalCommunityContext(
    community_reports=reports,
    entities=entities,  # default to None if you don't want to use community weights for ranking
    token_encoder=token_encoder,
)


context_builder_params = {
    "use_community_summary": False,  # False means using full community reports. True means using community short summaries.
    "shuffle_data": True,
    "include_community_rank": True,
    "min_community_rank": 0,
    "community_rank_name": "rank",
    "include_community_weight": True,
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    "context_name": "Reports",
}

map_llm_params = {
    "max_tokens": 1000,
    "temperature": 0.0,
    "response_format": {"type": "json_object"},
}

reduce_llm_params = {
    "max_tokens": 2000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000-1500)
    "temperature": 0.0,
}


global_search_engine = GlobalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    max_data_tokens=12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
    map_llm_params=map_llm_params,
    reduce_llm_params=reduce_llm_params,
    allow_general_knowledge=False,  # set this to True will add instruction to encourage the LLM to incorporate general knowledge in the response, which may increase hallucinations, but could be useful in some use cases.
    json_mode=True,  # set this to False if your LLM model does not support JSON mode.
    context_builder_params=context_builder_params,
    concurrent_coroutines=32,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [None]:
#result = await global_search_engine.asearch("what is common between all three of states including  ALASKA and Washington, D.C and california, only mention the common thing between them")
#result = await global_search_engine.asearch("what is common law in all three of states including  ALASKA and Washington, D.C and california, only mention the common thing between them")
result = await global_search_engine.asearch("what is top theme  in the data ")
#CONNECTICUT, VERMONT, NEW JERSEY, and PENNYSYLVANIA

print(result.response)

In [None]:
# inspect the data used to build the context for the LLM responses
result.context_data["reports"]

In [None]:
print(f"LLM calls: {result.llm_calls}. LLM tokens: {result.prompt_tokens}")

# Run Local Search on Data

In [166]:
# Local serach is used to answer questions which need the a very specific and granular part of ou graph for questions like "what is the name of country with X law"
context_builder = LocalSearchMixedContext(
    community_reports=reports,
    text_units=text_units,
    entities=entities,
    relationships=relationships,
    # if you did not run covariates during indexing, set this to None
    #covariates=covariates,
    entity_text_embeddings=description_embedding_store,
    embedding_vectorstore_key=EntityVectorStoreKey.ID,  # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
    text_embedder=text_embedder,
    token_encoder=token_encoder,
)

# text_unit_prop: proportion of context window dedicated to related text units
# community_prop: proportion of context window dedicated to community reports.
# The remaining proportion is dedicated to entities and relationships. Sum of text_unit_prop and community_prop should be <= 1
# conversation_history_max_turns: maximum number of turns to include in the conversation history.
# conversation_history_user_turns_only: if True, only include user queries in the conversation history.
# top_k_mapped_entities: number of related entities to retrieve from the entity description embedding store.
# top_k_relationships: control the number of out-of-network relationships to pull into the context window.
# include_entity_rank: if True, include the entity rank in the entity table in the context window. Default entity rank = node degree.
# include_relationship_weight: if True, include the relationship weight in the context window.
# include_community_rank: if True, include the community rank in the context window.
# return_candidate_context: if True, return a set of dataframes containing all candidate entity/relationship/covariate records that
# could be relevant. Note that not all of these records will be included in the context window. The "in_context" column in these
# dataframes indicates whether the record is included in the context window.
# max_tokens: maximum number of tokens to use for the context window.


local_context_params = {
    "text_unit_prop": 0.5,
    "community_prop": 0.1,
    "conversation_history_max_turns": 5,
    "conversation_history_user_turns_only": True,
    "top_k_mapped_entities": 10,
    "top_k_relationships": 10,
    "include_entity_rank": True,
    "include_relationship_weight": True,
    "include_community_rank": False,
    "return_candidate_context": False,
    "embedding_vectorstore_key": EntityVectorStoreKey.ID,  # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
    "max_tokens": 12_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

llm_params = {
    "max_tokens": 2_000,  # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
    "temperature": 0.0,
}


local_search_engine = LocalSearch(
    llm=llm,
    context_builder=context_builder,
    token_encoder=token_encoder,
    llm_params=llm_params,
    context_builder_params=local_context_params,
    response_type="multiple paragraphs",  # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

In [None]:
#result = await local_search_engine.asearch("what is common between all three of states including  ALASKA and Washington, D.C and california, only mention the common thing between them")
result = await local_search_engine.asearch("what is most important thing in NEW JERSEY transportation networks")

print(result.response)

In [None]:
result.context_data["entities"].head()

In [None]:
result.context_data["relationships"].head()

In [None]:
result.context_data["reports"].head()

In [None]:
result.context_data["sources"].head()
