feat: add Bash/Shell (.sh) parsing support (#197)#230
feat: add Bash/Shell (.sh) parsing support (#197)#230azizur100389 wants to merge 1 commit intotirth8205:mainfrom
Conversation
Register .sh / .bash / .zsh / .ksh with tree-sitter-bash. Extract File and Function nodes for shell function definitions, CALLS edges for command invocations (both to local functions and external binaries), and IMPORTS_FROM edges for `source` / `.` dot-includes. Test functions (test_* prefix) are classified as Test kind. Root cause of tirth8205#197 ------------------ Shell-script-heavy repos (installers, infra, CI tooling) previously indexed to 0 nodes / 0 edges because .sh was not in EXTENSION_TO_LANGUAGE. build_or_update_graph reported `parsed 0 files` for FragHub-style repos, which blocked architecture mapping and flow detection for the category of projects that most need them. What's supported ---------------- * Function definitions in both tree-sitter-bash shapes: foo() { ... } function foo { ... } function foo() { ... } * Call extraction for `command` nodes (tree-sitter-bash wraps every invocation in a `command` node with a `command_name` > `word` child). Local calls resolve to qualified names; external binaries (curl, grep, awk) are recorded with their raw name, consistent with how unresolved calls work in other languages. * `source lib.sh` and `. lib.sh` emit IMPORTS_FROM edges (both bareword and quoted-string arguments handled). * `test_*` prefix -> Test kind via the existing _TEST_PATTERNS list. Implementation -------------- 1. EXTENSION_TO_LANGUAGE: +4 shell extensions -> "bash" 2. _CLASS_TYPES["bash"] = [] (no class concept in bash) 3. _FUNCTION_TYPES["bash"] = ["function_definition"] 4. _IMPORT_TYPES["bash"] = [] (source handled via constructs handler) 5. _CALL_TYPES["bash"] = ["command"] 6. New _extract_bash_constructs() + _bash_get_source_target() helpers following the _extract_lua_constructs pattern. Dispatched next to the existing Lua/Luau handler. 7. _get_name: new "bash" branch that reads the first `word` child of a function_definition (handles both paren and function-keyword forms). 8. _get_call_name: new "bash" branch that reads command_name > word. Scope caveats documented in PR body ----------------------------------- * No cross-file call resolution beyond direct source includes (bash has no formal module system). * Calls whose name is a variable expansion ($foo, `$(cmd)`) are not statically resolvable and are skipped by _get_call_name. * Pipelines, subshells, heredocs are parsed by tree-sitter but not semantically linked yet. Tests added (tests/test_multilang.py::TestBashParsing — 10 tests) ----------------------------------------------------------------- - test_detects_language (.sh, .bash, .zsh) - test_finds_function_definitions_paren_form - test_finds_function_definitions_function_keyword_form - test_finds_source_imports (source + . + quoted string) - test_finds_calls_between_local_functions - test_finds_external_command_calls - test_finds_contains_edges - test_detects_test_functions - test_nodes_have_bash_language - test_calls_inside_functions New fixture: tests/fixtures/sample.sh exercises both function-definition forms, both import forms (bareword + quoted), local + external call mix, and test_* naming. Test results ------------ Stage 1 (new targeted tests): 10/10 passed. Stage 2 (tests/test_multilang.py full): 140/140 passed — no regressions across any language. Stage 3 (tests/test_parser.py adjacent): 67/67 passed. Stage 4 (full suite): 705 passed (up from 699 baseline — +6 new), 6 pre-existing Windows failures in test_incremental/test_notebook (verified identical on unchanged main in PR tirth8205#226 work). Stage 5 (ruff check): clean. Stage 6 (fixture smoke parse): 8 nodes, 26 edges. All expected function, call, import, and test nodes present. Zero regressions. All new code lives behind the bash language check so existing languages are untouched.
|
Thank you for this @CodeBlackwell — closing as superseded by PR #227 which already shipped Bash/Shell support in v2.3.0 (now on PyPI, along with v2.3.1 that adds the Windows MCP hang fix). Your PR was in flight at the same time I shipped #227, so this isn't redundant work on your end — the two implementations arrived essentially in parallel. Both approaches are very similar: If there's anything your implementation does that mine doesn't (e.g. Really appreciate the thorough test breakdown (10/10 targeted, 140/140 multilang, 705 full suite). That's exactly how I'd want a PR tested. |
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing .sh / .bash / .zsh entries added in tirth8205#227. Korn shell is close enough to bash syntactically that tree-sitter-bash handles the structural features the graph captures (function definitions, commands, source/. includes) correctly. Context ------- In the close comment on PR tirth8205#230, @tirth8205 explicitly flagged .ksh as a missing extension: "The .ksh extension in particular looks worth adding — I didn't include it in tirth8205#227." This PR addresses exactly that gap. Issue tirth8205#235 tracks the request. Why it matters -------------- Korn shell is still used in legacy AIX/Solaris operations, IBM internal tooling, and enterprise CI scripts. Repositories that ship .ksh scripts currently index to 0 nodes because the extension is unrecognized — the same failure mode that motivated tirth8205#197. Implementation -------------- One line added to EXTENSION_TO_LANGUAGE in parser.py: ".ksh": "bash" All of the bash parsing machinery shipped in tirth8205#227 (_FUNCTION_TYPES, _CALL_TYPES, _extract_bash_source_command, name/call resolution) already supports any file parsed through the "bash" language path, so no further changes are needed. Tests added (tests/test_multilang.py::TestBashParsing) ------------------------------------------------------ 1. test_detects_language — extended with a .ksh assertion to lock in the extension mapping (regression guard for tirth8205#235). 2. test_ksh_extension_parses_as_bash — end-to-end regression test that copies the existing tests/fixtures/sample.sh to a temp .ksh file, parses it through the real CodeParser, and asserts: - every node's language field is "bash" - the set of extracted Function names is identical to the .sh run - the CONTAINS / CALLS / IMPORTS_FROM edge counts per kind match The second assertion proves the .ksh path is fully wired through to the same structural extraction as .sh, not a degenerate zero-result read. Test results ------------ Stage 1 (new targeted tests): 2/2 passed. Stage 2 (tests/test_multilang.py full): 152/152 passed — zero regressions across any language. Stage 3 (tests/test_parser.py adjacent): 67/67 passed. Stage 4 (full suite): 733 passed. 8 pre-existing Windows failures in test_incremental (3) + test_main async coroutine detection (1) + test_notebook Databricks (4) — verified identical on unchanged main. Stage 5 (ruff check on parser.py and test_multilang.py): clean. Stage 6 (end-to-end smoke): detect_language("legacy.ksh") -> "bash"; parsing a real .ksh file produces 6 Function nodes, 18 edges, all tagged language=bash. Zero regressions. Single-line extension mapping change plus a targeted regression guard against the specific issue the maintainer flagged.
Summary
Register
.sh/.bash/.zsh/.kshwith tree-sitter-bash. ExtractFileandFunctionnodes for shell function definitions,CALLSedges for command invocations (local functions and external binaries), andIMPORTS_FROMedges forsource/.dot-includes.test_*prefix is classified asTestkind.Closes #197.
Root cause
Shell-script-heavy repos (installers, infra, CI tooling) previously indexed to 0 nodes / 0 edges because
.shwas not inEXTENSION_TO_LANGUAGE.build_or_update_graphreportedparsed 0 filesfor FragHub-style repos, which blocked architecture mapping and flow detection for the category of projects that most need them.What's supported
commandnode with acommand_name > wordchild. Local calls resolve to qualified names (file.sh::greet); external binaries (curl,grep,awk) are recorded with their raw name, consistent with how unresolved calls work in other languages.source lib.shand. lib.shemitIMPORTS_FROMedges. Both bareword (source lib/utils.sh) and quoted-string (source "lib/helpers.sh") arguments are handled.test_*prefix classified asTestkind via the existing_TEST_PATTERNSlist.Implementation
EXTENSION_TO_LANGUAGE— added.sh,.bash,.zsh,.ksh→"bash"_CLASS_TYPES["bash"] = [](no class concept in bash)_FUNCTION_TYPES["bash"] = ["function_definition"]_IMPORT_TYPES["bash"] = [](source handled via constructs handler)_CALL_TYPES["bash"] = ["command"]_extract_bash_constructs()+_bash_get_source_target()helpers following the_extract_lua_constructspattern, dispatched next to the existing Lua/Luau handler._get_name— new"bash"branch that reads the firstwordchild of afunction_definition(handles both paren and function-keyword forms)._get_call_name— new"bash"branch that readscommand_name > word, skipping commands whose name is a variable expansion (can't be resolved statically).Scope caveats
sourceincludes (bash has no formal module system).$foo,$(cmd)) are not statically resolvable and are skipped by_get_call_name.Tests added (
tests/test_multilang.py::TestBashParsing— 10 tests)test_detects_language(.sh, .bash, .zsh)test_finds_function_definitions_paren_formtest_finds_function_definitions_function_keyword_formtest_finds_source_imports(source+.+ quoted string)test_finds_calls_between_local_functionstest_finds_external_command_callstest_finds_contains_edgestest_detects_test_functionstest_nodes_have_bash_languagetest_calls_inside_functionsNew fixture:
tests/fixtures/sample.shexercises both function-definition forms, both import forms, local + external call mix, andtest_*naming.Test results
tests/test_multilang.pyfulltests/test_parser.pytest_incremental/test_notebook(verified identical on unchangedmain)ruff checkon changed filesZero regressions. All new code is gated on the bash language check so existing languages are untouched.