Skip to content

Closed - Design docs for "Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings"#65

Closed
tekrajchhetri wants to merge 12 commits intomainfrom
design_docs
Closed

Closed - Design docs for "Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings"#65
tekrajchhetri wants to merge 12 commits intomainfrom
design_docs

Conversation

@tekrajchhetri
Copy link
Collaborator

This PR contains the high-level design docs for "Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings"

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tekrajchhetri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design document for a new research knowledge graph system, outlining its architecture, use cases, and requirements. Concurrently, it delivers substantial enhancements to the underlying infrastructure, focusing on improving the robustness, scalability, and user experience of the platform. Key areas of improvement include database connection management, asynchronous processing for CPU-bound tasks, a sophisticated job tracking and recovery system for data ingestion, and better handling of large files. These changes collectively aim to build a more resilient and performant foundation for the knowledge graph project.

Highlights

  • Comprehensive Design Document for Research Knowledge Graphs: A new design document, 'PI-Grant-Skills-Research-Design.md', has been added. It outlines the vision, use cases, goals, and detailed requirements for constructing enriched research knowledge graphs and an interactive user interface for project grants and research findings. This document serves as a foundational reference for the project's evolving implementation.
  • Enhanced Database Connection Pooling and Robustness: Significant improvements have been made to PostgreSQL connection pooling in both ml_service and query_service. This includes dynamic calculation of pool sizes based on worker count, increased connection acquisition timeouts (from 5s to 30s), and detailed logging of pool status. These changes aim to prevent connection exhaustion and improve stability under high concurrency.
  • Asynchronous Password Hashing and Verification: Password hashing and verification operations (get_password_hash, verify_password) have been converted to asynchronous functions using asyncio.to_thread. This offloads CPU-bound tasks to a separate thread pool, preventing them from blocking the main event loop and improving overall API responsiveness.
  • Atomic User Registration and Login Optimization: User registration (/register) now handles unique email violations atomically at the database level, eliminating race conditions. The login (/token) endpoint has been optimized to reuse a single database connection for both authentication and scope retrieval, reducing connection overhead.
  • Advanced Job Processing, Tracking, and Recovery System: The ingestion job system in query_service has been significantly upgraded. This includes new database columns and a job_processing_log table for detailed real-time status updates, current file/stage tracking, and comprehensive processing history. New API endpoints for checking job recoverability and manually triggering recovery of 'stuck' jobs have been introduced, alongside an automatic startup recovery mechanism.
  • Optimized Large File Ingestion with Provenance: The file ingestion process now includes optimizations for large files, such as chunked reading during upload to prevent memory spikes and a 'skip_provenance' option for faster ingestion. A lightweight provenance attachment mechanism has been added for large files, focusing on graph-level metadata for performance.
  • WebSocket Authentication Support: A new authenticate_websocket function has been added to query_service/core/security.py, enabling JWT-based authentication for WebSocket connections, aligning security practices with HTTP endpoints.
  • Infrastructure Setup and Volume Management Improvements: The start_services.sh script now includes a validate_oxigraph_volumes function to automatically check, create, and set permissions for Oxigraph data and temporary bind-mount paths. A safety warning and confirmation prompt have been added for docker-compose down -v to prevent accidental data loss.
  • New Load Testing Utility: A new asynchronous load testing script (load_test_token_endpoint.py) has been added to the ml_service for performance testing the JWT token endpoint, complete with latency metrics, status counts, and plot generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docker-compose.unified.yml
    • Added command to the postgres service to configure max_connections=200 and shared_buffers=256MB for improved performance.
    • Added comments for oxigraph service volumes, clarifying usage of OXIGRAPH_DATA_PATH and OXIGRAPH_TMP_PATH for bind mounts and emphasizing permission requirements.
  • docs/design_docs/PI-Grant-Skills-Research-Design.md
    • New file added, detailing the design for 'Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings'.
    • Includes sections on Overview, Use Cases (e.g., understanding funding landscape, discovering collaborators, assessing impact), Goals (e.g., construct enriched KGs, build interactive UI, integrate with existing efforts), Requirements (Knowledge Graph, User Interface, Backend/Platform), and Tasks (UI Design, Ontology/Schema Modeling, Backend Services).
  • ml_service/core/database.py
    • Imported status from fastapi for more explicit HTTP status codes.
    • Revised PostgreSQL connection pool size calculation logic to support higher concurrency, defaulting to 25 connections per worker in production and allowing environment override via DB_POOL_MAX_SIZE.
    • Added specific exception handling for asyncpg.exceptions.UniqueViolationError during user insertion, returning a 400 BAD_REQUEST with a user-friendly message for duplicate emails.
    • Modified get_scopes_by_user to optionally accept an existing database connection, enabling better connection management and reuse in calling functions.
  • ml_service/core/routers/jwt_auth.py
    • Updated the /register endpoint to use get_db_connection as a context manager and removed the explicit get_user pre-check, relying on the database's unique constraint and insert_data's new exception handling for duplicate emails.
    • Updated the /token (login) endpoint to use get_db_connection as a context manager and reuse the same connection for both user authentication and scope retrieval, minimizing connection pool usage.
  • ml_service/core/security.py
    • Imported asyncio for asynchronous operations.
    • Converted get_password_hash and verify_password functions to async using asyncio.to_thread to prevent blocking the event loop during CPU-bound password operations.
  • ml_service/test/load_test_token_endpoint.py
    • New file added, providing a comprehensive asynchronous load testing script for the /api/token endpoint.
    • Features include concurrent request handling, latency measurement, success/failure tracking, JSONL logging of individual requests, summary statistics, and generation of various plots (latency over time, histogram, ECDF, status counts).
  • query_service/core/database.py
    • Simplified DB_SETTINGS initialization by directly loading environment variables.
    • Replaced sys.stdout.write with print for standard logging output.
    • Enhanced get_db_connection with detailed logging of connection pool status (size, idle, in_use) before and after acquiring/releasing connections.
    • Increased the connection acquisition timeout in get_db_connection from 5 seconds to 30 seconds to better handle temporary bursts of requests.
    • Added specific exception handling for asyncpg.exceptions.UniqueViolationError during user insertion, returning a 400 BAD_REQUEST for duplicate emails.
    • Modified get_scopes_by_user to optionally accept an existing database connection.
    • Introduced update_job_processing_state to update a job's current_file, current_stage, and status_message for real-time progress tracking.
    • Added insert_processing_log and get_processing_log functions to store and retrieve a detailed history of job processing stages and messages.
    • Extended get_job_by_id_and_user and list_user_jobs to fetch and return new job-related fields such as current_file, current_stage, status_message, unrecoverable, unrecoverable_reason, progress_percent, elapsed_seconds, and can_recover.
  • query_service/core/main.py
    • Modified the jobs table creation schema to include new columns: current_file, current_stage, status_message, unrecoverable (BOOLEAN), and unrecoverable_reason (TEXT).
    • Added ALTER TABLE statements to safely add these new columns to existing jobs tables, ensuring backward compatibility.
    • Created a new job_processing_log table with columns for job_id, file_name, stage, status_message, timestamp, file_index, and total_files, along with indexes for efficient querying.
    • Implemented a startup event (startup_event) to automatically call recover_stuck_jobs (from core.routers.insert) to identify and mark as 'error' any jobs that were left in a 'running' state from a previous server session (e.g., due to crashes or restarts).
  • query_service/core/routers/insert.py
    • Introduced _running_job_tasks global dictionary to track active background ingestion tasks, mapping job_id to asyncio.Task objects.
    • Modified process_file_with_provenance to accept a skip_provenance boolean flag, allowing for faster ingestion by bypassing provenance attachment.
    • Added _process_large_file_with_lightweight_provenance function to efficiently append minimal provenance metadata to very large files without loading the entire graph into memory.
    • Enhanced upload_single_file_path to integrate with the new job tracking system, updating current_file, current_stage, and status_message in the database, and logging detailed processing steps.
    • Optimized upload_single_file_path for large files by implementing chunked reading for httpx.AsyncClient content, preventing memory spikes.
    • Refactored run_ingest_job to include a MAX_JOB_TIMEOUT (2 hours) for the entire job execution, ensuring jobs don't run indefinitely.
    • Implemented batch database updates (_batch_update_job_results) for job results and progress, significantly reducing database connection overhead during ingestion.
    • Added robust error handling and a finally block in run_ingest_job to ensure job status is always updated, even in case of unexpected crashes.
    • Introduced check_job_recoverable endpoint (GET /insert/jobs/check-recoverable) to perform a lightweight check on whether a job is eligible for recovery, considering its status, age, and whether its process is still active.
    • Added recover_stuck_jobs_endpoint (POST /insert/jobs/recover) to manually trigger the recovery of stuck jobs. This endpoint first checks recoverability and marks jobs as 'unrecoverable' if they don't meet criteria or if their process is still running.
    • Enhanced get_job_detail to provide a more comprehensive view of job status, including an 'effective_status' (done, partial, failed), elapsed time, estimated remaining time, detailed processing history, and a list of failed files.
  • query_service/core/security.py
    • Imported asyncio and WebSocket.
    • Converted get_password_hash and verify_password functions to async using asyncio.to_thread to prevent blocking the event loop.
    • Added authenticate_websocket function to handle JWT authentication for WebSocket connections, supporting tokens from both Authorization headers and query parameters.
  • query_service/core/shared.py
    • Refined the attach_provenance function to simplify provenance attachment. It now focuses on adding graph-level provenance metadata (ingestion activity, user, timestamp) directly to the graph, rather than attempting to link provenance to individual entities, which improves performance for large graphs.
  • readme.md
    • Removed the 'Troubleshooting' section, likely due to the new automated volume validation in start_services.sh addressing common issues.
  • start_services.sh
    • Added a new shell function validate_oxigraph_volumes to check if OXIGRAPH_DATA_PATH and OXIGRAPH_TMP_PATH (for bind mounts) exist and are writable, creating them and setting permissions if necessary.
    • Integrated the validate_oxigraph_volumes function call before starting Docker services (up or start commands).
    • Added a safety warning and confirmation prompt when docker-compose down -v (or --volumes) is detected, to prevent accidental deletion of persistent data volumes (Oxigraph, PostgreSQL, pgAdmin).
Activity
  • A new design document was created to outline the project's vision and technical requirements.
  • Database connection pooling logic was refactored and optimized for better performance and stability.
  • CPU-intensive operations like password hashing were made asynchronous to improve API responsiveness.
  • The job ingestion and processing system was significantly enhanced with real-time tracking, detailed logging, and robust recovery mechanisms.
  • New API endpoints were developed to allow users to monitor and recover their ingestion jobs.
  • Infrastructure scripts were updated to improve setup reliability and prevent accidental data loss.
  • A new load testing utility was added to ensure the performance of critical authentication endpoints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of improvements focused on performance, scalability, and robustness, particularly within the data ingestion and job management systems. Key enhancements include optimizing database connection pooling, making password hashing asynchronous to prevent event loop blocking, and implementing a sophisticated job recovery mechanism for stuck jobs. The addition of a load testing script is a commendable step towards ensuring system stability. The new design document is also a valuable contribution. My feedback primarily focuses on minor code refinements to improve maintainability and memory efficiency.

![](img/poc-stats-brainkb.png)
![](img/bbqs-projects.png)

By bringing together concepts discussed across channels, PoC artifacts, and related efforts, this document aims to align define scope, tasks and connect parallel work streams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a minor grammatical issue in this sentence. It reads ...aims to align define scope..., which seems to be missing a conjunction or comma. To improve clarity, I suggest rephrasing it.

Suggested change
By bringing together concepts discussed across channels, PoC artifacts, and related efforts, this document aims to align define scope, tasks and connect parallel work streams.
By bringing together concepts discussed across channels, PoC artifacts, and related efforts, this document aims to align, define scope, and connect parallel work streams.


7. **Administrative Data Ingestion and Management**
The system **must provide an administrative interface** that allows authorized users to configure and trigger data ingestion from external sources. This interface should support updating, refreshing, and managing source data so that the knowledge graph can be **automatically constructed and updated** based on incoming data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This 'Important' note about model-driven UI feels disconnected from the surrounding text, placed between two different requirement sections. To improve the document's structure and flow, consider integrating this point as a formal requirement within the 'User Interface Requirements' section or as a key consideration under the 'User Interface Design and Development' task list.

Comment on lines +364 to +376
if conn is not None:
# Use provided connection
results = await conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
else:
# Manage our own connection
async with get_db_connection() as connection:
results = await connection.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to fetch scopes and create the list of scope names is duplicated in both the if conn is not None: and the else: block. To adhere to the DRY (Don't Repeat Yourself) principle and improve maintainability, this logic can be extracted into a local helper function.

Suggested change
if conn is not None:
# Use provided connection
results = await conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
else:
# Manage our own connection
async with get_db_connection() as connection:
results = await connection.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
async def _fetch_scopes(db_conn):
results = await db_conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
if conn is not None:
# Use provided connection
return await _fetch_scopes(conn)
else:
# Manage our own connection
async with get_db_connection() as connection:
return await _fetch_scopes(connection)

Comment on lines +30 to +36
DB_SETTINGS = {
"user": load_environment()["JWT_POSTGRES_DATABASE_USER"],
"password": load_environment()["JWT_POSTGRES_DATABASE_PASSWORD"],
"database": load_environment()["JWT_POSTGRES_DATABASE_NAME"],
"host": load_environment()["JWT_POSTGRES_DATABASE_HOST_URL"],
"port": load_environment()["JWT_POSTGRES_DATABASE_PORT"],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The previous get_db_settings function provided default values for database connection settings (e.g., host = 'postgres'), which made local setup more robust if the .env file was incomplete. The new direct assignment to DB_SETTINGS will cause the application to fail if any of these environment variables are missing. While explicit configuration is good, this could be a regression in developer experience for local setups. Consider reintroducing default values using .get('VAR', 'default').

Suggested change
DB_SETTINGS = {
"user": load_environment()["JWT_POSTGRES_DATABASE_USER"],
"password": load_environment()["JWT_POSTGRES_DATABASE_PASSWORD"],
"database": load_environment()["JWT_POSTGRES_DATABASE_NAME"],
"host": load_environment()["JWT_POSTGRES_DATABASE_HOST_URL"],
"port": load_environment()["JWT_POSTGRES_DATABASE_PORT"],
}
env = load_environment()
DB_SETTINGS = {
"user": env.get("JWT_POSTGRES_DATABASE_USER", "postgres"),
"password": env.get("JWT_POSTGRES_DATABASE_PASSWORD", ""),
"database": env.get("JWT_POSTGRES_DATABASE_NAME", "brainkb"),
"host": env.get("JWT_POSTGRES_DATABASE_HOST_URL", "postgres"),
"port": int(env.get("JWT_POSTGRES_DATABASE_PORT", 5432)),
}

Comment on lines +365 to +377
if conn is not None:
# Use provided connection
results = await conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
else:
# Manage our own connection
async with get_db_connection() as connection:
results = await connection.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to fetch scopes and create the list of scope names is duplicated in both the if conn is not None: and the else: block. To adhere to the DRY (Don't Repeat Yourself) principle and improve maintainability, this logic can be extracted into a local helper function.

Suggested change
if conn is not None:
# Use provided connection
results = await conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
else:
# Manage our own connection
async with get_db_connection() as connection:
results = await connection.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
async def _fetch_scopes(db_conn):
results = await db_conn.fetch(query, user_id)
assigned_scopes_to_user = [result["name"] for result in results]
logger.debug(f"Scopes for user {user_id}: {assigned_scopes_to_user}")
return assigned_scopes_to_user
if conn is not None:
# Use provided connection
return await _fetch_scopes(conn)
else:
# Manage our own connection
async with get_db_connection() as connection:
return await _fetch_scopes(connection)

Comment on lines +175 to +237
async def _process_large_file_with_lightweight_provenance(
filepath: str,
user_id: str,
ext: str,
) -> Tuple[str, bool]:
"""
OPTIMIZATION: For large files, append lightweight provenance without full graph parsing.
This avoids loading entire 40MB+ files into memory for parsing.
"""
from rdflib import Graph, URIRef, Literal, RDF, XSD, DCTERMS, PROV
from rdflib import Namespace
import datetime
import uuid

filename = os.path.basename(filepath)
processed_filepath = filepath + ".processed.ttl"

try:
# Generate lightweight provenance (minimal RDF, no full graph parsing)
start_time = datetime.datetime.utcnow().isoformat() + "Z"
provenance_uuid = str(uuid.uuid4())
BASE = Namespace("https://identifiers.org/brain-bican/vocab/")

prov_entity = URIRef(BASE[f"provenance/{provenance_uuid}"])
ingestion_activity = URIRef(BASE[f"ingestionActivity/{provenance_uuid}"])
user_uri = URIRef(BASE[f"agent/{user_id}"])

# Create minimal provenance graph
prov_graph = Graph()
prov_graph.add((prov_entity, RDF.type, PROV.Entity))
prov_graph.add((prov_entity, PROV.generatedAtTime, Literal(start_time, datatype=XSD.dateTime)))
prov_graph.add((prov_entity, PROV.wasAttributedTo, user_uri))
prov_graph.add((prov_entity, PROV.wasGeneratedBy, ingestion_activity))
prov_graph.add((ingestion_activity, RDF.type, PROV.Activity))
prov_graph.add((ingestion_activity, RDF.type, BASE["IngestionActivity"]))
prov_graph.add((ingestion_activity, PROV.generatedAtTime, Literal(start_time, datatype=XSD.dateTime)))
prov_graph.add((ingestion_activity, PROV.wasAssociatedWith, user_uri))
prov_graph.add((prov_entity, DCTERMS.provenance, Literal(f"Data ingested by {user_id} on {start_time}")))

# Serialize provenance to Turtle
provenance_ttl = prov_graph.serialize(format="turtle")

# Stream original file and append provenance (memory efficient)
with open(processed_filepath, "w", encoding="utf-8") as out:
# Copy original file content
with open(filepath, "r", encoding="utf-8", errors="ignore") as inf:
# For very large files, copy in chunks
while True:
chunk = inf.read(10 * 1024 * 1024) # 10 MB chunks
if not chunk:
break
out.write(chunk)

# Append provenance at the end
out.write("\n\n# Provenance metadata\n")
out.write(provenance_ttl)

return processed_filepath, True

except Exception as e:
logger.warning(f"Failed lightweight provenance for {filename}: {e}. Using original file.", exc_info=True)
return filepath, False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function _process_large_file_with_lightweight_provenance is defined here, but it doesn't appear to be called anywhere in the codebase included in this pull request. If it's intended for future use, it might be fine, but as it stands, it's dead code. Please either integrate it or remove it to keep the codebase clean.

Comment on lines 337 to +355
try:
with open(processed_filepath, "r", encoding="utf-8") as f:
file_data = f.read()
payload = file_data.encode("utf-8")
content_type = "text/turtle"
content_type = "text/turtle" if processed_filepath.endswith(".ttl") else get_content_type_for_ext(ext)

# Read file in chunks and combine (more memory efficient than reading all at once)
chunks = []
with open(processed_filepath, "rb") as f:
while True:
chunk = f.read(16 * 1024 * 1024) # 16 MB chunks
if not chunk:
break
chunks.append(chunk)
payload = b"".join(chunks)

resp = await client.post(
url,
content=payload,
headers={"Content-Type": content_type},
auth=auth,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This optimization for large files is a good step, but it still loads the entire file content into memory with payload = b"".join(chunks) before making the HTTP request. For extremely large files, this can still cause memory issues. To achieve true streaming and minimize memory usage, you can pass an async generator to httpx.AsyncClient.post. This would stream the file from disk directly to the request body.

            content_type = "text/turtle" if processed_filepath.endswith(".ttl") else get_content_type_for_ext(ext)
            
            async def file_streamer():
                with open(processed_filepath, "rb") as f:
                    while True:
                        chunk = f.read(16 * 1024 * 1024)  # 16 MB chunks
                        if not chunk:
                            break
                        yield chunk
            
            resp = await client.post(
                url,
                content=file_streamer(),
                headers={"Content-Type": content_type},
                auth=auth,
            )

Comment on lines +1422 to +1426
if job.get("current_stage"):
resp["stage_description"] = stage_descriptions.get(
job["current_stage"],
f"Current stage: {job['current_stage']}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code, which sets the stage_description, is a duplicate of the elif block on lines 1416-1420. This redundant code should be removed to improve clarity and maintainability.

Comment on lines +732 to 818
# def attach_provenance(user: str, ttl_data: str) -> str:
# """
# Attach the provenance information about the ingestion activity. Saying, we received this triple by X user on XXXX date.
# It appends provenance triples externally while keeping the original triples intact.
#
# Parameters:
# - user (str): The username of the person posting the data.
# - ttl_data (str): The existing Turtle (TTL) RDF data.
#
# Returns:
# - str: Combined RDF (Turtle format) containing original data and provenance metadata.
# """
# # Validate input parameters
# if not isinstance(user, str) or not user.strip():
# raise ValueError("User must be a non-empty string.")
# if not isinstance(ttl_data, str) or not ttl_data.strip():
# raise ValueError("TTL data must be a non-empty string.")
#
# try:
# original_graph = Graph()
# original_graph.parse(data=ttl_data, format="turtle")
# except Exception as e:
# raise RuntimeError(f"Error parsing TTL data: {e}")
#
# try:
# BASE = extract_base_namespace(original_graph)
# except Exception as e:
# raise RuntimeError(f"Failed to extract base namespace: {e}")
#
# try:
# # Create provenance graph
# prov_graph = Graph()
#
# # Generate timestamps (ISO 8601 format, UTC)
# start_time = datetime.datetime.utcnow().isoformat() + "Z"
#
# # Generate a unique UUID for provenance entity
# provenance_uuid = str(uuid.uuid4())
# prov_entity = URIRef(BASE[f"provenance/{provenance_uuid}"])
# ingestion_activity = URIRef(BASE[f"ingestionActivity/{provenance_uuid}"])
# user_uri = URIRef(BASE[f"agent/{user}"])
#
# # Define provenance entity
# prov_graph.add((prov_entity, RDF.type, PROV.Entity))
# prov_graph.add((prov_entity, PROV.generatedAtTime, Literal(start_time, datatype=XSD.dateTime)))
# prov_graph.add((prov_entity, PROV.wasAttributedTo, user_uri))
# prov_graph.add((prov_entity, PROV.wasGeneratedBy, ingestion_activity))
#
# # Define ingestion activity
# # here we say IngestionActivity is an activity of type prov:Activity
# prov_graph.add((ingestion_activity, RDF.type, PROV.Activity))
# prov_graph.add((ingestion_activity, RDF.type, BASE["IngestionActivity"]))
# prov_graph.add((ingestion_activity, PROV.generatedAtTime, Literal(start_time, datatype=XSD.dateTime)))
# prov_graph.add((ingestion_activity, PROV.wasAssociatedWith, user_uri))
#
# # Attach provenance to original triples
# # OPTIMIZATION: Use set to avoid duplicate checks and limit entities for performance
# # Adaptive limit based on graph size to balance performance vs completeness
# graph_size = len(original_graph)
# if graph_size > 100000: # Very large graphs (>100k triples)
# max_entities = 500 # Limit more aggressively
# elif graph_size > 50000: # Large graphs (50k-100k triples)
# max_entities = 750
# else: # Medium/small graphs (<50k triples)
# max_entities = 1000 # Can process more entities
#
# entity_count = 0
# seen_entities = set()
#
# for entity in original_graph.subjects():
# if entity_count >= max_entities:
# break
# if isinstance(entity, URIRef) and entity not in seen_entities:
# seen_entities.add(entity)
# prov_graph.add((ingestion_activity, PROV.wasAssociatedWith, entity))
# entity_count += 1
#
# # add a Dublin Core provenance statement -- this is the new addition to say it's ingested by user
# prov_graph.add((prov_entity, DCTERMS.provenance, Literal(f"Data ingested by {user} on {start_time}")))
#
# # Combine both graphs (original + provenance) so that we have new provenance information attached.
# final_graph = original_graph + prov_graph
#
# return final_graph.serialize(format="turtle")
# except Exception as e:
# raise RuntimeError(f"Error generating provenance RDF: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The old implementation of attach_provenance has been commented out. Since the new implementation is much cleaner and more efficient, this large block of dead code should be removed to improve the readability and maintainability of the file.

@tekrajchhetri
Copy link
Collaborator Author

tekrajchhetri commented Feb 5, 2026

You may notice additional changes in this PR. These are due to other PRs that have not yet been merged. For this review, please focus only on the design document https://github.com/sensein/BrainKB/blob/design_docs/docs/design_docs/PI-Grant-Skills-Research-Design.md.

@djarecka
Copy link
Contributor

djarecka commented Feb 6, 2026

@tekrajchhetri - could you please open a new PR with the design doc only?

@tekrajchhetri
Copy link
Collaborator Author

tekrajchhetri commented Feb 6, 2026

@djarecka It will be the same unless the other PR is merged as it's created from other working branch not main. Let me create a new branch from main and do that.

@tekrajchhetri tekrajchhetri changed the title Design docs for "Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings" Closed - Design docs for "Construction of Enriched Research Knowledge Graphs and an Interactive User Interface for Project Grants and Research Findings" Feb 6, 2026
@djarecka
Copy link
Contributor

djarecka commented Feb 6, 2026

let's try to create new branches from main unless there is really a reason not to.

also, this PR contained more than the design doc, and PR #59 , so perhaps you ave more chnage that want to commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants