### LightRAG Markdown Ingestion

#### Reset LightRag (Optional)

In [1]:
from pathlib import Path
import shutil

LIGHTRAG_DIR = Path(".rag/lightrag")
if LIGHTRAG_DIR.exists():
    shutil.rmtree(LIGHTRAG_DIR)
    print(f"Removed existing LightRAG workspace: {LIGHTRAG_DIR}")
LIGHTRAG_DIR.mkdir(parents=True, exist_ok=True)

Removed existing LightRAG workspace: .rag/lightrag


#### LightRag Ingestion

In [2]:
import os, sys
from pathlib import Path
from dotenv import load_dotenv

repo_root = Path.cwd().resolve().parent
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

load_dotenv(repo_root / ".env", override=True)

from KnowledgeGraph.LightRAG.lightrag_runner import (
    ChunkingConfig,
    FallbackConfig,
    build_lightrag,
    ensure_initialized,
    ingest_markdown_files,
    DEFAULT_SOURCE_DIR,
)

source_dir_env = os.getenv("LIGHTRAG_SOURCE_DIR")
source_dir = Path(source_dir_env) if source_dir_env else DEFAULT_SOURCE_DIR
if not source_dir.is_absolute():
    source_dir = (repo_root / source_dir).resolve()

MAX_WORKERS = 40  # set an integer to override LightRAG defaults
chunk_cfg = ChunkingConfig.from_env()
fallback_cfg = FallbackConfig.from_env()

rag = build_lightrag()
await ensure_initialized(rag)

ingested = await ingest_markdown_files(
    rag,
    source_dir=source_dir,
    max_workers=MAX_WORKERS,
    show_progress=True,
    show_chunk_progress=True,
    chunking=chunk_cfg,
    fallback=fallback_cfg,
)

print(f"Ingested {len(ingested)} markdown files from {source_dir}.")

INFO: [_] Created new empty graph file: .rag/lightrag/graph_chunk_entity_relation.graphml
INFO:nano-vectordb:Init {'embedding_dim': 3072, 'metric': 'cosine', 'storage_file': '.rag/lightrag/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 3072, 'metric': 'cosine', 'storage_file': '.rag/lightrag/vdb_relationships.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 3072, 'metric': 'cosine', 'storage_file': '.rag/lightrag/vdb_chunks.json'} 0 data
INFO: Reset 1 documents from PROCESSING/FAILED to PENDING status
INFO: Processing 1 document(s)
INFO: Extracting stage 1/1: 2025-04-27_en.md
INFO: Processing d-id: doc-3e38e39869541f028371688baefa7699
INFO: Embedding func: 8 new workers initialized (Timeouts: Func: 30s, Worker: 60s, Health Check: 75s)
INFO: Chunk 1 of 2236 extracted 44 Ent + 42 Rel chunk-4eed60346530c72dea17daf43f3318fb
INFO: Chunk 2 of 2236 extracted 0 Ent + 0 Rel chunk-9efc314b65237d5d646e1b817372afc6
INFO: Chunk 3 of 2236 extracted 16 Ent + 18 Rel chunk-

CancelledError: 

INFO:  == LLM cache == saving: default:extract:773b22c02cca8b6abb393925ee326b20
INFO:  == LLM cache == saving: default:extract:2339c444c0ba476ec7c463d65dfc204d
INFO:  == LLM cache == saving: default:extract:661c6c5405622e5abe8ae68c2773bc3f
INFO:  == LLM cache == saving: default:extract:5fe1a8e79741caf31d29a8da1d334f20
INFO:  == LLM cache == saving: default:extract:49b05be77fa3c07291d42a2f6c01f27d
INFO:  == LLM cache == saving: default:extract:1e1fa77ff4b411d9fa41366a58ab3dd9
INFO:  == LLM cache == saving: default:extract:7ca6d62f40746a543d3e9d2315d2bb02
INFO:  == LLM cache == saving: default:extract:48fd1d3b9aa48c031f6da0a15c459a58
INFO:  == LLM cache == saving: default:extract:c971bffc556ab08c7bf1d5c39b34195c
INFO: Chunk 137 of 2236 extracted 21 Ent + 19 Rel chunk-55180b13e8b026de267158d77de89b61
INFO:  == LLM cache == saving: default:extract:56b31872a2bf46b1bc8391700ba2ff46
INFO: Chunk 138 of 2236 extracted 5 Ent + 5 Rel chunk-5ba83b28eabb1bee14c449ddea86e330
INFO:  == LLM cache == s