Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow alternative vector db engine to be used #106

Merged
merged 32 commits into from
Jun 12, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
f79631d
fix: allow alternative vector db engine to be used
borisarzentar Jun 6, 2024
00b60a9
added chunking config options
Vasilije1990 Jun 9, 2024
11231b7
rewrote chunking config
Vasilije1990 Jun 9, 2024
a23fc40
Updates to the configs
Vasilije1990 Jun 10, 2024
c9b2a06
rewrote configs
Vasilije1990 Jun 10, 2024
91864dc
fix to graph config
Vasilije1990 Jun 10, 2024
de972df
added topology to modules from csv, json
Vasilije1990 Jun 10, 2024
a0e9860
removed old topology
Vasilije1990 Jun 10, 2024
4f76c46
fixes
Vasilije1990 Jun 10, 2024
9fd542c
added topology refactor
Vasilije1990 Jun 10, 2024
d0939b9
added updates to topology
Vasilije1990 Jun 12, 2024
409d3c7
add updates
Vasilije1990 Jun 12, 2024
a197177
Updates to the configs
Vasilije1990 Jun 12, 2024
0d230c9
Add qdrant test
Vasilije1990 Jun 12, 2024
d5c7c66
Add qdrant test
Vasilije1990 Jun 12, 2024
b6a2a40
Add qdrant test
Vasilije1990 Jun 12, 2024
f5c0e27
Add qdrant test
Vasilije1990 Jun 12, 2024
7c66364
test: add weaviate integration test
borisarzentar Jun 12, 2024
636b548
Merge remote-tracking branch 'origin/fix/setting-alternative-vector-d…
borisarzentar Jun 12, 2024
0603fa8
test: add github action running weaviate integration test
borisarzentar Jun 12, 2024
3577be3
fix: change github actions names
borisarzentar Jun 12, 2024
e896fa3
Add NEO4J test
Vasilije1990 Jun 12, 2024
20d8bc3
Merge remote-tracking branch 'origin/fix/setting-alternative-vector-d…
Vasilije1990 Jun 12, 2024
e2db4d7
Add NEO4J test
Vasilije1990 Jun 12, 2024
ddb9914
Add NEO4J test
Vasilije1990 Jun 12, 2024
89f0d0a
Add NEO4J test
Vasilije1990 Jun 12, 2024
39b346d
Add NEO4J test
Vasilije1990 Jun 12, 2024
b68580c
Add NEO4J test
Vasilije1990 Jun 12, 2024
6a69279
fix: configure api client graph path
borisarzentar Jun 12, 2024
adedfa4
Merge remote-tracking branch 'origin/fix/setting-alternative-vector-d…
borisarzentar Jun 12, 2024
7466818
Add NEO4J test
Vasilije1990 Jun 12, 2024
e660410
chore: increase version to 0.1.12
borisarzentar Jun 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ __pycache__/
*.py[cod]
*$py.class

notebooks/
full_run.ipynb
evals/

# C extensions
*.so

Expand Down
2 changes: 1 addition & 1 deletion cognee-frontend/src/app/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ export default function Home() {
expireIn={notification.expireIn}
onClose={notification.delete}
>
<Text>{notification.message}</Text>
<Text nowrap>{notification.message}</Text>
</Notification>
))}
</NotificationContainer>
Expand Down
27 changes: 16 additions & 11 deletions cognee/api/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,15 +72,21 @@ async def get_dataset_graph(dataset_id: str):
from cognee.infrastructure.databases.graph import get_graph_config
from cognee.infrastructure.databases.graph.get_graph_client import get_graph_client

graph_config = get_graph_config()
graph_engine = graph_config.graph_engine
graph_client = await get_graph_client(graph_engine)
graph_url = await render_graph(graph_client.graph)
try:
# graph_config = get_graph_config()
# graph_engine = graph_config.graph_engine
graph_client = await get_graph_client()
graph_url = await render_graph(graph_client.graph)

return JSONResponse(
status_code = 200,
content = str(graph_url),
)
return JSONResponse(
status_code = 200,
content = str(graph_url),
)
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the exception type to improve error handling.

-    except:
+    except SpecificExceptionType:

Committable suggestion was skipped due to low confidence.

Tools
Ruff

85-85: Do not use bare except (E722)

return JSONResponse(
status_code = 409,
content = "Graphistry credentials are not set. Please set them in your .env file.",
)

@app.get("/datasets/{dataset_id}/data", response_model=list)
async def get_dataset_data(dataset_id: str):
Expand All @@ -106,7 +112,7 @@ async def get_dataset_status(datasets: Annotated[List[str], Query(alias="dataset

return JSONResponse(
status_code = 200,
content = { dataset["data_id"]: dataset["status"] for dataset in datasets_statuses },
content = datasets_statuses,
)

@app.get("/datasets/{dataset_id}/data/{data_id}/raw", response_class=FileResponse)
Expand Down Expand Up @@ -264,8 +270,7 @@ def start_api_server(host: str = "0.0.0.0", port: int = 8000):
relational_config.create_engine()

vector_config = get_vectordb_config()
vector_config.vector_db_path = databases_directory_path
vector_config.create_engine()
vector_config.vector_db_url = os.path.join(databases_directory_path, "cognee.lancedb")

base_config = get_base_config()
data_directory_path = os.path.abspath(".data_storage")
Expand Down
157 changes: 91 additions & 66 deletions cognee/api/v1/cognify/cognify.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
import nltk
from asyncio import Lock
from nltk.corpus import stopwords

from cognee.infrastructure.data.chunking.LangchainChunkingEngine import LangchainChunkEngine
from cognee.infrastructure.data.chunking.get_chunking_engine import get_chunk_engine
from cognee.infrastructure.databases.graph.config import get_graph_config
from cognee.infrastructure.databases.vector.embeddings.LiteLLMEmbeddingEngine import LiteLLMEmbeddingEngine
from cognee.modules.cognify.graph.add_node_connections import group_nodes_by_layer, \
graph_ready_output, connect_nodes_in_graph
from cognee.modules.cognify.graph.add_data_chunks import add_data_chunks, add_data_chunks_basic_rag
Expand All @@ -23,7 +27,7 @@
from cognee.modules.data.get_content_summary import get_content_summary
from cognee.modules.data.get_cognitive_layers import get_cognitive_layers
from cognee.modules.data.get_layer_graphs import get_layer_graphs
from cognee.shared.data_models import KnowledgeGraph
from cognee.shared.data_models import KnowledgeGraph, ChunkStrategy, ChunkEngine
from cognee.shared.utils import send_telemetry
from cognee.modules.tasks import create_task_status_table, update_task_status
from cognee.shared.SourceCodeGraph import SourceCodeGraph
Expand All @@ -45,9 +49,9 @@ async def cognify(datasets: Union[str, List[str]] = None):
stopwords.ensure_loaded()
create_task_status_table()

graph_config = get_graph_config()
graph_db_type = graph_config.graph_engine
graph_client = await get_graph_client(graph_db_type)
# graph_config = get_graph_config()
# graph_db_type = graph_config.graph_engine
graph_client = await get_graph_client()

relational_config = get_relationaldb_config()
db_engine = relational_config.database_engine
Expand All @@ -61,14 +65,19 @@ async def handle_cognify_task(dataset_name: str):
async with update_status_lock:
task_status = get_task_status([dataset_name])

if task_status == "DATASET_PROCESSING_STARTED":
if dataset_name in task_status and task_status[dataset_name] == "DATASET_PROCESSING_STARTED":
logger.info(f"Dataset {dataset_name} is being processed.")
return

update_task_status(dataset_name, "DATASET_PROCESSING_STARTED")

await cognify(dataset_name)
update_task_status(dataset_name, "DATASET_PROCESSING_FINISHED")
try:
await cognify(dataset_name)
update_task_status(dataset_name, "DATASET_PROCESSING_FINISHED")
except Exception as error:
update_task_status(dataset_name, "DATASET_PROCESSING_ERROR")
raise error


# datasets is a list of dataset names
if isinstance(datasets, list):
Expand All @@ -89,7 +98,7 @@ async def handle_cognify_task(dataset_name: str):
dataset_files.append((added_dataset, db_engine.get_files_metadata(added_dataset)))

chunk_config = get_chunk_config()
chunk_engine = chunk_config.chunk_engine
chunk_engine = get_chunk_engine()
chunk_strategy = chunk_config.chunk_strategy

async def process_batch(files_batch):
Expand All @@ -104,7 +113,7 @@ async def process_batch(files_batch):
text = "empty file"
if text == "":
text = "empty file"
subchunks = chunk_engine.chunk_data(chunk_strategy, text, chunk_config.chunk_size, chunk_config.chunk_overlap)
subchunks,_ = chunk_engine.chunk_data(chunk_strategy, text, chunk_config.chunk_size, chunk_config.chunk_overlap)

if dataset_name not in data_chunks:
data_chunks[dataset_name] = []
Expand Down Expand Up @@ -136,21 +145,37 @@ async def process_batch(files_batch):
batch_size = 20
file_count = 0
files_batch = []
from cognee.infrastructure.databases.graph.config import get_graph_config
graph_config = get_graph_config()
graph_topology = graph_config.graph_model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable graph_topology is declared but not used, which could lead to unnecessary memory usage.

- graph_topology = graph_config.graph_model
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
graph_topology = graph_config.graph_model
Tools
Ruff

150-150: Local variable graph_topology is assigned to but never used (F841)

Remove assignment to unused variable graph_topology



if graph_config.infer_graph_topology and graph_config.graph_topology_task:
from cognee.modules.topology.topology import TopologyEngine
topology_engine = TopologyEngine(infer=graph_config.infer_graph_topology)
await topology_engine.add_graph_topology(dataset_files=dataset_files)
elif not graph_config.infer_graph_topology:
from cognee.modules.topology.topology import TopologyEngine
topology_engine = TopologyEngine(infer=graph_config.infer_graph_topology)
await topology_engine.add_graph_topology(graph_config.topology_file_path)
elif not graph_config.graph_topology_task:
parent_node_id = f"DefaultGraphModel__{USER_ID}"


for (dataset_name, files) in dataset_files:
for file_metadata in files:
graph_topology = graph_config.graph_model

if graph_topology == SourceCodeGraph:
parent_node_id = f"{file_metadata['name']}.{file_metadata['extension']}"
if parent_node_id:
document_id = await add_document_node(
graph_client,
parent_node_id = parent_node_id,
document_metadata = file_metadata,
)
else:
parent_node_id = f"DefaultGraphModel__{USER_ID}"

document_id = await add_document_node(
graph_client,
parent_node_id=parent_node_id,
document_metadata=file_metadata,
)
document_id = await add_document_node(
graph_client,
parent_node_id=file_metadata['id'],
document_metadata=file_metadata,
)

files_batch.append((dataset_name, file_metadata, document_id))
file_count += 1
Expand All @@ -171,7 +196,7 @@ async def process_text(chunk_collection: str, chunk_id: str, input_text: str, fi
print(f"Processing chunk ({chunk_id}) from document ({file_metadata['id']}).")

graph_config = get_graph_config()
graph_client = await get_graph_client(graph_config.graph_engine)
graph_client = await get_graph_client()
graph_topology = graph_config.graph_model

if graph_topology == SourceCodeGraph:
Expand Down Expand Up @@ -240,52 +265,52 @@ async def process_text(chunk_collection: str, chunk_id: str, input_text: str, fi



# if __name__ == "__main__":
if __name__ == "__main__":

# async def test():
# # await prune.prune_system()
# # #
# # from cognee.api.v1.add import add
# # data_directory_path = os.path.abspath("../../../.data")
# # # print(data_directory_path)
# # # config.data_root_directory(data_directory_path)
# # # cognee_directory_path = os.path.abspath("../.cognee_system")
# # # config.system_root_directory(cognee_directory_path)
# #
# # await add("data://" +data_directory_path, "example")
async def test():
# await prune.prune_system()
# #
# from cognee.api.v1.add import add
# data_directory_path = os.path.abspath("../../../.data")
# # print(data_directory_path)
# # config.data_root_directory(data_directory_path)
# # cognee_directory_path = os.path.abspath("../.cognee_system")
# # config.system_root_directory(cognee_directory_path)
#
# await add("data://" +data_directory_path, "example")

# text = """import subprocess
# def show_all_processes():
# process = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE)
# output, error = process.communicate()
text = """Conservative PP in the lead in Spain, according to estimate
An estimate has been published for Spain:

# if error:
# print(f"Error: {error}")
# else:
# print(output.decode())
Opposition leader Alberto Núñez Feijóo’s conservative People’s party (PP): 32.4%

# show_all_processes()"""

# from cognee.api.v1.add import add

# await add([text], "example_dataset")

# infrastructure_config.set_config( {"chunk_engine": LangchainChunkEngine() , "chunk_strategy": ChunkStrategy.CODE,'embedding_engine': LiteLLMEmbeddingEngine() })
# from cognee.shared.SourceCodeGraph import SourceCodeGraph
# from cognee.api.v1.config import config

# # config.set_graph_model(SourceCodeGraph)
# # config.set_classification_model(CodeContentPrediction)
# # graph = await cognify()
# vector_client = infrastructure_config.get_config("vector_engine")

# out = await vector_client.search(collection_name ="basic_rag", query_text="show_all_processes", limit=10)

# print("results", out)
# #
# # from cognee.shared.utils import render_graph
# #
# # await render_graph(graph, include_color=True, include_nodes=False, include_size=False)

# import asyncio
# asyncio.run(test())
Spanish prime minister Pedro Sánchez’s Socialist party (PSOE): 30.2%

The far-right Vox party: 10.4%

In Spain, the right has sought to turn the European election into a referendum on Sánchez.

Ahead of the vote, public attention has focused on a saga embroiling the prime minister’s wife, Begoña Gómez, who is being investigated over allegations of corruption and influence-peddling, which Sanchez has dismissed as politically-motivated and totally baseless."""

from cognee.api.v1.add import add

await add([text], "example_dataset")

from cognee.api.v1.config.config import config
config.set_chunk_engine(ChunkEngine.LANGCHAIN_ENGINE )
config.set_chunk_strategy(ChunkStrategy.LANGCHAIN_CHARACTER)
config.embedding_engine = LiteLLMEmbeddingEngine()

graph = await cognify()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable graph is assigned but never used. Consider removing it if it's not needed.

- graph = await cognify()
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
graph = await cognify()
await cognify()
Tools
Ruff

304-304: Local variable graph is assigned to but never used (F841)

Remove assignment to unused variable graph

# vector_client = infrastructure_config.get_config("vector_engine")
#
# out = await vector_client.search(collection_name ="basic_rag", query_text="show_all_processes", limit=10)
#
# print("results", out)
#
# from cognee.shared.utils import render_graph
#
# await render_graph(graph, include_color=True, include_nodes=False, include_size=False)

import asyncio
asyncio.run(test())
20 changes: 18 additions & 2 deletions cognee/api/v1/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ def system_root_directory(system_root_directory: str):
relational_config.create_engine()

vector_config = get_vectordb_config()
vector_config.vector_db_path = databases_directory_path
vector_config.create_engine()
if vector_config.vector_engine_provider == "lancedb":
vector_config.vector_db_url = os.path.join(databases_directory_path, "cognee.lancedb")

@staticmethod
def data_root_directory(data_root_directory: str):
Expand Down Expand Up @@ -89,3 +89,19 @@ def connect_documents(connect_documents: bool):
def set_chunk_strategy(chunk_strategy: object):
chunk_config = get_chunk_config()
chunk_config.chunk_strategy = chunk_strategy

@staticmethod
def set_chunk_engine(chunk_engine: object):
chunk_config = get_chunk_config()
chunk_config.chunk_engine = chunk_engine

@staticmethod
def set_chunk_overlap(chunk_overlap: object):
chunk_config = get_chunk_config()
chunk_config.chunk_overlap = chunk_overlap

@staticmethod
def set_chunk_size(chunk_size: object):
chunk_config = get_chunk_config()
chunk_config.chunk_size = chunk_size

Empty file removed cognee/api/v1/topology/__init__.py
Empty file.
Loading
Loading