# Searchflow

### Packages needed to run this notebook
- libmagic: ```brew install libmagic```

### ChromaDB
Don't forget to start the ChromaDB server: ```docker run -p 7777:8000 chromadb/chroma```

### Run the API
https://docs.astral.sh/uv/guides/integration/fastapi/

For development purposes, you can run the API with: ```uv run fastapi dev```

In [1]:
# Load packages from the src directory
import sys
import json
sys.path.append('../src')
from vectrix_graphs import ExtractDocuments, setup_logger, ExtractMetaData


from dotenv import load_dotenv
load_dotenv()



True

### Extracting chunks of data from a document

In [2]:
# Create chunks of data from a document
extract = ExtractDocuments(
    logger=setup_logger(name="Files", level="INFO"),
    )

result = extract.extract(file_path="./files/arabic_test.docx")


[32m2024-10-27 12:25:26,522 - Files - INFO - Extracting documents from ./files/arabic_test.docx[0m


In [3]:
print('Metadata:')
print(json.dumps(result[1].metadata, indent=4))

print('Content:')
print(result[1].page_content)

Metadata:
{
    "file_directory": "./files",
    "filename": "arabic_test.docx",
    "last_modified": "2024-10-25T10:09:47",
    "orig_elements": "eJzdWNuO2zYQ/RXBTwlQbnm/+A8KFH3at6YwhrfGwHo38KpFgqD/Xsmk4ZFpZbVAmtr7IhDkkBwenjPD0e9fV+kh7dJjv9nG1bpbsSSFoM4S0C4TabQhXolIKCSVIjMhZr36qVvtUg8RehjmfF0F6NOfT/svm5g+9R+HLjpY5O1D2sTtPoV+GBrXvvt57Hte1cFH2KWxG/bgt2HTp+f+Lj6Fz+P4Azz3m91T3OZtOvjFKZeEUcLVPaNr6tbSrP4ZDPv0uR/HP/xFNTfjV8rDVx164uELhx57+MLJksO4Vf/l08GN+23/kMY1zyGxSitrpSTUCU9kFIkAV5F4H5mXhgaqxE1AosevMKifdydksGlBrJoi3CTHiP26fe5/6dPuEmh04AoFHgkIx4j0ihIbhCbMe7DWspzS9YJWuBMOGDDEF9yGDgHlEMqAkC3DPCN8C5qyayAXp2G86XGlltvmfFPu5kwzWjUU06UXmRSNiocxAgQgUmdPwFlOvPWghXLZRvfyRbL/5yIFhhxh49c3fnsdUqbGDO2aExVbgZ21MzuJyX6Lpa4Ss5FxkkBrIukQH32gQJi2PBqtk2PsahniERL6PEG8EZ5UxwtPMuoqM6TqTuNClkUWMaRDAxgce+5/zcDmFYdczD7mqAieE5PjkJ2lysTG5IZk7ZVykhsO/ibYJ5ueN8u+eiJ9mi01GpD8tFTZdUo1nIZLkm6dKi8/IZApeuXU7F1XKqhAw+ZiW4xEdbAiiPdTyHMUi8X0+GWGQp6gG+Ot3qa+LVWD8NnRwCTRTKghFjNPbA6SJJcoU8PTS19vtp7cqz2/hrerBhxTj9505zo4crE9RTin7axrE

In [4]:
# Add additional metadata using a NER-pipeline
ner = ExtractMetaData(
    logger=setup_logger(name="ExtractMetaData", level="INFO",),
    model="gpt-4o-mini" # Options are gpt-4o-mini, llama3.1-8B , llama3.1-70B
    )
result_with_metadata = ner.extract(result, source="uploaded_file")

[32m2024-10-27 12:25:42,232 - ExtractMetaData - INFO - Extracting metadata from 9 documents, using gpt-4o-mini[0m


In [5]:
result_with_metadata[0].metadata

{'filename': 'arabic_test.docx',
 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
 'author': 'هيئة السوق المالية',
 'source': 'uploaded_file',
 'word_count': 41,
 'language': 'AR',
 'content_type': 'other',
 'tags': "['Finance', 'Stock Market', 'Saudi Arabia']",
 'summary': 'The document outlines the procedures for suspending the trading of listed securities according to the listing rules approved by the Capital Market Authority in Saudi Arabia.',
 'read_time': 0.205,
 'last_modified': '2024-10-25T10:09:47'}

### Adding the documents to a vector database (Chroma)

For this demo, the vector database will be saved locally on disk, restarting the container will delete the database.
I prefer using the cosine distance instead of the default squared L2 distance, we pass this using the `hnsw:space` metadata.

$$
d = 1.0 - \frac{\sum(A_i \times B_i)}{\sqrt{\sum(A_i^2) \cdot \sum(B_i^2)}}
$$

We use Ollama to calculate the embeddings locally with BGE-M3, since over a 100 langues are supported this is ideal for embedding Arabic documents.

BGE-M3 is based on the XLM-RoBERTa architecture and is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity:

- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- Multi-Linguality: It can support more than 100 working languages.
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

> ℹ️ So all embeddings will be calculated locally ℹ️


In [6]:
from vectrix_graphs import vectordb

vectordb.remove_collection("demo")
vectordb.create_collection("demo")

In [7]:
vectordb.add_documents(result_with_metadata)

In [8]:
# Now let's query the vector database
vectordb.similarity_search(
    query="What is the attention mechanism?",
    k=3
    )

[Document(metadata={'author': '', 'content_type': 'other', 'filename': 'arabic_test.docx', 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'language': 'AR', 'last_modified': '2024-10-25T10:09:47', 'read_time': 0.71, 'source': 'uploaded_file', 'summary': 'يتناول المحتوى إجراءات تعليق تداول الأوراق المالية المدرجة في حال عدم التزام المصدر بنشر معلوماته المالية خلال المدة المحددة.', 'tags': "['الأسواق المالية', 'تداول الأوراق المالية', 'الإفصاح المالي']", 'word_count': 142, 'cosine_distance': 938.0967500597377, 'uuid': '71263779-03da-4b05-8387-ce236519e49d'}, page_content='في حال عدم تمكن المصدر من الإعلان عن الحدث خلال المدة التي حددها فيجب عليه الإعلان عن سبب ذلك قبل انتهاء تلك المدة.\n\nترفع السوق التعليق مباشرة فور انتهاء المدة التي حددها المصدر في طلب التعليق ما لم ترَ الهيئة أو السوق خلاف ذلك.\n\nثانياً: تعليق تداول الأوراق المالية المدرجة عند عدم نشر المصدر معلوماته المالية (الأولية أو السنوية)\nوفقاً للفقرة الفرعية (1) من الفقرة (ج) من المادة

## Asking questions to the Graph
 Let's now ask questions using the LangGraph workflow

### Example 1: Using closed source LLMs


### Example 2: Using open-source LLMs that can be self-hosted

In [None]:
# Load packages from the src directory
import sys
from IPython.display import Markdown, display, Image
sys.path.append('../src')

from langchain_core.messages import HumanMessage
from vectrix_graphs import local_slm_demo

# Display the graph
display(Image(local_slm_demo.get_graph().draw_mermaid_png()))

#Ask the question
input = [HumanMessage(content="What is the attention mechanism?")]


# Run the graph
response = await local_slm_demo.ainvoke({"messages": input})
display(Markdown(f"***Question:*** \n {input[0].content}\n"))
display(Markdown(response['messages'][-1].content))

In [None]:
# Load packages from the src directory
import sys
import json
from IPython.display import Markdown, display, Image
sys.path.append('../src')

from langchain_core.messages import HumanMessage
from vectrix_graphs import local_slm_demo

#Ask the question
input = [HumanMessage(content="What is the attention mechanism?")]

response = await local_slm_demo.ainvoke({"messages": input})
print(json.dumps(response, indent=4))

In [None]:
print(response['messages'][-1])

In [None]:
from langchain_core.messages import HumanMessage, AIMessage
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate


human = HumanMessage(content="What is the attention mechanism?")
ai = AIMessage(content="The attention mechanism is a technique used in neural networks to enable the model to focus on relevant parts of the input data during processing.")

messages = ChatPromptTemplate.from_messages([human, ai])


llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = messages | llm | StrOutputParser()

chain.invoke({})