# Setup Environment

In [1]:
!pip install langchain



In [2]:
!pip install sentence-transformers==2.2.2



In [9]:
!pip install InstructorEmbedding==1.0.1

Collecting InstructorEmbedding==1.0.1
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Installing collected packages: InstructorEmbedding
Successfully installed InstructorEmbedding-1.0.1


In [3]:
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

key_vault_name = "kv-bsauwmno"
kv_uri = f"https://{key_vault_name}.vault.azure.net/"

credential = DefaultAzureCredential()
client = SecretClient(vault_url=kv_uri, credential=credential)

# Now you can use neo4j_url, neo4j_port, and neo4j_password in your application
neo4j_url = client.get_secret("NEO4JURL").value
neo4j_user = client.get_secret("NEO4JUSER").value
neo4j_password = client.get_secret("NEO4JPASSWORD").value

# TODO: add OpenAI token to Keyvault

In [4]:
import langchain
langchain.verbose = True
langchain.debug = True

In [5]:
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings

embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-xl", 
    cache_folder='./models/model_cache_xl'
)

load INSTRUCTOR_Transformer
max_seq_length  512


When the vector store is a graph database, there are a few things to take into account:

- The graph database should implement the GraphStore interface, which includes methods for adding graph documents, querying the graph, and refreshing the schema.
- The as_retriever() function will internally convert the graph database into a vector store retriever by extracting the embeddings from the graph documents and creating a vector store.
- The retriever will then perform the retrieval operations based on the provided configuration, such as the search type, search arguments, and any additional filters.

It's important to ensure that the graph database you are using is compatible with the GraphStore interface and supports the necessary operations for retrieval.

In [8]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.document_loaders import TextLoader
from langchain.docstore.document import Document

# Now we initialize from existing graph -> doesn't
neo4j_graph = Neo4jVector.from_existing_index(
    embedding=embeddings,
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password,
    index_name="vi_chunk_embedding_cosine",
    keyword_index_name="fts_Chunk_text",
    search_type="hybrid",
)


# Documentation
- Retrieve more documents with higher diversity: `as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25})`
- Fetch more documents for the MMR algorithm to consider, but only return the top 5: `as_retriever(search_type="mmr", search_kwargs={'k': 5, 'fetch_k': 50})`
- Only retrieve documents that have a relevance score above a certain threshold: `as_retriever(search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.8})`
- Only get the single most similar document from the dataset: `as_retriever(search_kwargs={'k': 1})`
- Use a filter to only retrieve documents from a specific paper: `as_retriever(search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}})`

In [9]:
from typing import List

def print_docs(l: List[langchain.schema.document.Document]) -> str:
    for i, d in enumerate(l): 
        print(25 * "=", f" Document {i+1} ", 25 * "=")
        print(f"1. Metadata", 52 * "=" )
        for k, v in d.metadata.items():
            print(f"- {k}: {v}")
        
        print(f"2. Content", 53 * "=" )
        print(d.page_content)

In [10]:
from typing import List

# Define the configuration
search_config = {
     # "similarity" (default), "mmr", or "similarity_score_threshold".
    'search_type': 'similarity', 
    'search_kwargs': {
        # Amount of documents to return (default: 4).
        'k': 10, 
        # Amount of documents to pass to the MMR algorithm 
        # # (default: 20).
        'fetch_k': 50, 
        # Minimum relevance threshold for similarity_score_threshold.
        'score_threshold': 0, 
        # Diversity of results returned by MMR; 
        # # 1 for minimum diversity and 0 for maximum (default: 0.5).
        'lambda_mult': 0.25, 
        # Filter by document metadata.
        'filter': {'chunk_size': 500}
    }
}

neo4j_retriever = neo4j_graph.as_retriever(**search_config)

docs: List[langchain.schema.document.Document] = (
    neo4j_retriever.get_relevant_documents(
        "Wat moet ik doen voor mijn nieroperatie?"
    )
)

In [9]:
print_docs(l=docs)

- chunk_size: 500
- embedding_model: hkunlp/instructor-xl
- chunk_order: 2431
- chunk_overlap: 60
- chunk_id: 2431
Er zijn verschillende types nierstenen (zes om precies te zijn), met elk hun eigen oorzaak en ontstaansmechanisme. Als je het type steen en de risicofactoren kent, kun je behandeling daarop afstemmen en is advies op maat mogelijk. 


Heel concreet geeft het metabool bilan een antwoord op de volgende vragen:


* Waarom maak ik nierstenen aan?
* Welke maatregelen zijn nodig om mijn kans op herval te verminderen.



## Voor wie?
- chunk_size: 500
- embedding_model: hkunlp/instructor-xl
- chunk_order: 1462
- chunk_overlap: 60
- chunk_id: 1462
## Hoe bereid ik me het best voor?


Je hoeft niet nuchter te zijn. Voor de start van het onderzoek bespreken we met jou de taken die je tijdens het onderzoek moet uitvoeren. Dit kunnen motorische taken zijn (waarbij je een bepaald lichaamsdeel moet bewegen) of cognitieve taken (waarbij je alleen moet denken) of beiden. Zorg ervoor dat je

# Exploring Contextualised Retrievals

The graph contains loads of metadata that can be leveraged to make the retrieved documents more relevant for a given chain. 

In this example, we query a langchain vectorstore (forwards the cypher query), and pattern match for specific characterics of the retrieved results. 

In [25]:
neo4j_graph

<langchain.vectorstores.neo4j_vector.Neo4jVector at 0x7fdba3243ca0>

In [53]:
print(f"""
            CALL db.index.vector.queryNodes(
            "fts_Chunk_text", 
            '{embeddings.embed_query(KW_SEARCH_QUERY)}'
        ) 
        YIELD node, score
        WITH node, score
        ORDER BY score DESCENDING
        LIMIT {RESULT_LIM}

        MATCH rel=(node:Chunk)<-[r*]-(:Catalog)
        RETURN rel
""")


            CALL db.index.vector.queryNodes(
            "fts_Chunk_text", 
            '[0.018609369173645973, 0.02538321539759636, 0.004355217330157757, -0.07988579571247101, -0.053384460508823395, -0.0030687369871884584, -0.06292525678873062, -0.018009264022111893, -0.013918952085077763, -0.005331815220415592, -0.013592589646577835, 0.01572948507964611, -0.07878094911575317, -0.10487205535173416, -0.043990883976221085, -0.020900895819067955, 0.008492556400597095, -0.027654755860567093, 0.001870549633167684, -0.04285578057169914, -0.02984999492764473, 0.040656525641679764, 0.00395499961450696, 0.05010785907506943, -0.04869605973362923, -0.045616090297698975, -0.027370914816856384, 0.01108709815889597, -0.058365583419799805, -0.0061892373487353325, 0.10293402522802353, 0.01899600401520729, 0.018185339868068695, -0.015510343946516514, 0.02616843394935131, 0.09165032207965851, -0.03410722687840462, 0.007897515781223774, 0.040500085800886154, -0.05239090695977211, -0.045771654695272446,

In [12]:
from langchain.docstore.document import Document
from typing import Dict
 
RESULT_LIM = 5
KW_SEARCH_QUERY = "dokter"
VECTOR_SEARCH_QUERY = "Ik heb pijn aan mijn nieren, wat moet ik doen?"

# Keyword query 
kw_chunk_paths = neo4j_graph.query(
    f"""
        CALL db.index.fulltext.queryNodes("fts_Chunk_text", '{KW_SEARCH_QUERY}') 
        YIELD node, score
        WITH node, score
        ORDER BY score DESCENDING
        LIMIT {RESULT_LIM}

        MATCH rel=(node:Chunk)<-[r*]-(:Catalog)
        RETURN rel
    """
)

# Vector query 
vec_chunk_paths = neo4j_graph.query(
    f"""
        CALL db.index.vector.queryNodes(
            "vi_chunk_embedding_cosine", 
            {RESULT_LIM},
            {embeddings.embed_query(VECTOR_SEARCH_QUERY)}
        ) 
        YIELD node, score
        WITH node, score
        ORDER BY score DESCENDING
        MATCH rel=(node:Chunk)<-[r*]-(:Catalog)
        RETURN rel
    """
)

def chunk_paths_to_docs(chunk_paths: List[object]) -> List[Document]:
    """
        For now, this matches chunk_paths of type (Chunk)-rel-(WebPage)-()
    """
    # This returns paths, that we can turn into LangChain documents somehow. 
    path_docs: List[Document] = []

    # One result for every chunk (see above)
    for p in chunk_paths:
        chunk_path_str = '' 
        chunk_node = p['rel'][0] 
        chunk_text = chunk_node.get('text')
        
        # Build up metadata of Document object manually
        doc_meta = {
            'chunk_size': chunk_node.get('chunk_size'),
            'embedding_model': chunk_node.get('embedding_model'),
            'chunk_order': chunk_node.get('chunk_order'),
            'chunk_overlap': chunk_node.get('chunk_overlap'),
            'chunk_id': chunk_node.get('chunk_id'),
        }
        # Traverse path for metadata
        for i, o in enumerate(p['rel']): 
            # Create path representation
            if type(o) == dict:
                chunk_path_str += f'(Node)'
            elif type(o) == str:
                chunk_path_str += f'<-{o}-'
            
            # WebPage node data
            if i == 2:
                doc_meta['webpage_scrape_dt'] = o.get('scrape_dt')
                doc_meta['webpage_url'] = o.get('url')
                doc_meta['webpage_title'] = o.get('title')
            # Catalog node data
            elif i == 4: 
                doc_meta['catalog_url'] = o.get('url')
        
        # Add path structure as metadata
        doc_meta['path_context'] = chunk_path_str   
        
        # Extract metadata from traversal 
        path_docs.append(
            Document(
                page_content=chunk_text, 
                metadata=doc_meta
            )
        )
    return path_docs

# print_docs(
#     chunk_paths_to_docs(kw_chunk_paths)
# )

print_docs(
    chunk_paths_to_docs(vec_chunk_paths)
)

- chunk_size: 500
- embedding_model: hkunlp/instructor-xl
- chunk_order: 3793
- chunk_overlap: 60
- chunk_id: 3793
- webpage_scrape_dt: 12/11/2023 15:30:44
- webpage_url: https://www.azstlucas.be/onderzoek-en-behandelingen/steenverbrijzeling-eswl
- webpage_title: Steenverbrijzeling (ESWL)
- catalog_url: https://www.azstlucas.be/onderzoek-en-behandelingen
- path_context: (Node)<-HAS_CHUNK-(Node)<-HAS_WEBPAGE-(Node)
* Als je algemeen onwel voelt met een toenemende pijn in de flank, moet je contact opnemen met je uroloog. In dat geval is het mogelijk dat je een bloeduitstorting op de nier hebt. Dat wordt best opgevolgd in het ziekenhuis.
- chunk_size: 500
- embedding_model: hkunlp/instructor-xl
- chunk_order: 3793
- chunk_overlap: 60
- chunk_id: 3793
- webpage_scrape_dt: None
- webpage_url: None
- webpage_title: None
- catalog_url: https://www.azstlucas.be/onderzoek-en-behandelingen/steenverbrijzeling-eswl
- path_context: (Node)<-NEXT_CHUNK-(Node)<-HAS_CHUNK-(Node)<-HAS_WEBPAGE-(Node)
* A

In [21]:
import pandas as pd
from typing import List 
from neo4j import GraphDatabase, RoutingControl
search_query = "dokter"

driver = GraphDatabase.driver(neo4j_url, auth=(neo4j_user, neo4j_password))
with driver.session(database="neo4j") as session:
    graph = session.run(
        f"""
            CALL db.index.fulltext.queryNodes("fts_Chunk_text", '{search_query}') 
            YIELD node, score
            WITH node, score
            ORDER BY score DESCENDING
            LIMIT 10
            OPTIONAL MATCH rel=(node)<-[r*]-(:Catalog)
            RETURN rel
        """
    ).graph()

graph

<neo4j.graph.Graph at 0x7fdc4ed08220>