Skip to content

feat: add Bash/Shell (.sh) parsing support (#197)#230

Closed
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/bash-parsing
Closed

feat: add Bash/Shell (.sh) parsing support (#197)#230
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/bash-parsing

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

Register .sh / .bash / .zsh / .ksh with tree-sitter-bash. Extract File and Function nodes for shell function definitions, CALLS edges for command invocations (local functions and external binaries), and IMPORTS_FROM edges for source / . dot-includes. test_* prefix is classified as Test kind.

Closes #197.

Root cause

Shell-script-heavy repos (installers, infra, CI tooling) previously indexed to 0 nodes / 0 edges because .sh was not in EXTENSION_TO_LANGUAGE. build_or_update_graph reported parsed 0 files for FragHub-style repos, which blocked architecture mapping and flow detection for the category of projects that most need them.

What's supported

  • Function definitions in both tree-sitter-bash shapes
    foo() { ... }
    function foo { ... }
    function foo() { ... }
  • Call extraction — tree-sitter-bash wraps every invocation in a command node with a command_name > word child. Local calls resolve to qualified names (file.sh::greet); external binaries (curl, grep, awk) are recorded with their raw name, consistent with how unresolved calls work in other languages.
  • Dot-includes as importssource lib.sh and . lib.sh emit IMPORTS_FROM edges. Both bareword (source lib/utils.sh) and quoted-string (source "lib/helpers.sh") arguments are handled.
  • Test detectiontest_* prefix classified as Test kind via the existing _TEST_PATTERNS list.

Implementation

  1. EXTENSION_TO_LANGUAGE — added .sh, .bash, .zsh, .ksh"bash"
  2. _CLASS_TYPES["bash"] = [] (no class concept in bash)
  3. _FUNCTION_TYPES["bash"] = ["function_definition"]
  4. _IMPORT_TYPES["bash"] = [] (source handled via constructs handler)
  5. _CALL_TYPES["bash"] = ["command"]
  6. New _extract_bash_constructs() + _bash_get_source_target() helpers following the _extract_lua_constructs pattern, dispatched next to the existing Lua/Luau handler.
  7. _get_name — new "bash" branch that reads the first word child of a function_definition (handles both paren and function-keyword forms).
  8. _get_call_name — new "bash" branch that reads command_name > word, skipping commands whose name is a variable expansion (can't be resolved statically).

Scope caveats

  • No cross-file call resolution beyond direct source includes (bash has no formal module system).
  • Calls whose name is a variable expansion ($foo, $(cmd)) are not statically resolvable and are skipped by _get_call_name.
  • Pipelines, subshells, heredocs are parsed by tree-sitter but not semantically linked yet.

Tests added (tests/test_multilang.py::TestBashParsing — 10 tests)

  • test_detects_language (.sh, .bash, .zsh)
  • test_finds_function_definitions_paren_form
  • test_finds_function_definitions_function_keyword_form
  • test_finds_source_imports (source + . + quoted string)
  • test_finds_calls_between_local_functions
  • test_finds_external_command_calls
  • test_finds_contains_edges
  • test_detects_test_functions
  • test_nodes_have_bash_language
  • test_calls_inside_functions

New fixture: tests/fixtures/sample.sh exercises both function-definition forms, both import forms, local + external call mix, and test_* naming.

Test results

Stage Result
Stage 1 — new targeted tests 10/10 passed
Stage 2 — tests/test_multilang.py full 140/140 passed — zero regressions across any language
Stage 3 — adjacent tests/test_parser.py 67/67 passed
Stage 4 — full suite 705 passed (up from 699 baseline — +6 new), 6 pre-existing Windows failures in test_incremental / test_notebook (verified identical on unchanged main)
Stage 5 — ruff check on changed files clean
Stage 6 — fixture smoke parse 8 nodes, 26 edges — all expected function, call, import, and test nodes present

Zero regressions. All new code is gated on the bash language check so existing languages are untouched.

Register .sh / .bash / .zsh / .ksh with tree-sitter-bash. Extract File
and Function nodes for shell function definitions, CALLS edges for
command invocations (both to local functions and external binaries),
and IMPORTS_FROM edges for `source` / `.` dot-includes. Test functions
(test_* prefix) are classified as Test kind.

Root cause of tirth8205#197
------------------
Shell-script-heavy repos (installers, infra, CI tooling) previously
indexed to 0 nodes / 0 edges because .sh was not in EXTENSION_TO_LANGUAGE.
build_or_update_graph reported `parsed 0 files` for FragHub-style repos,
which blocked architecture mapping and flow detection for the category
of projects that most need them.

What's supported
----------------
* Function definitions in both tree-sitter-bash shapes:
    foo() { ... }
    function foo { ... }
    function foo() { ... }
* Call extraction for `command` nodes (tree-sitter-bash wraps every
  invocation in a `command` node with a `command_name` > `word` child).
  Local calls resolve to qualified names; external binaries (curl, grep,
  awk) are recorded with their raw name, consistent with how unresolved
  calls work in other languages.
* `source lib.sh` and `. lib.sh` emit IMPORTS_FROM edges (both bareword
  and quoted-string arguments handled).
* `test_*` prefix -> Test kind via the existing _TEST_PATTERNS list.

Implementation
--------------
1. EXTENSION_TO_LANGUAGE: +4 shell extensions -> "bash"
2. _CLASS_TYPES["bash"] = []  (no class concept in bash)
3. _FUNCTION_TYPES["bash"] = ["function_definition"]
4. _IMPORT_TYPES["bash"] = []  (source handled via constructs handler)
5. _CALL_TYPES["bash"] = ["command"]
6. New _extract_bash_constructs() + _bash_get_source_target() helpers
   following the _extract_lua_constructs pattern. Dispatched next to
   the existing Lua/Luau handler.
7. _get_name: new "bash" branch that reads the first `word` child of
   a function_definition (handles both paren and function-keyword forms).
8. _get_call_name: new "bash" branch that reads command_name > word.

Scope caveats documented in PR body
-----------------------------------
* No cross-file call resolution beyond direct source includes (bash has
  no formal module system).
* Calls whose name is a variable expansion ($foo, `$(cmd)`) are not
  statically resolvable and are skipped by _get_call_name.
* Pipelines, subshells, heredocs are parsed by tree-sitter but not
  semantically linked yet.

Tests added (tests/test_multilang.py::TestBashParsing — 10 tests)
-----------------------------------------------------------------
- test_detects_language (.sh, .bash, .zsh)
- test_finds_function_definitions_paren_form
- test_finds_function_definitions_function_keyword_form
- test_finds_source_imports (source + . + quoted string)
- test_finds_calls_between_local_functions
- test_finds_external_command_calls
- test_finds_contains_edges
- test_detects_test_functions
- test_nodes_have_bash_language
- test_calls_inside_functions

New fixture: tests/fixtures/sample.sh exercises both function-definition
forms, both import forms (bareword + quoted), local + external call
mix, and test_* naming.

Test results
------------
Stage 1 (new targeted tests): 10/10 passed.
Stage 2 (tests/test_multilang.py full): 140/140 passed — no regressions
  across any language.
Stage 3 (tests/test_parser.py adjacent): 67/67 passed.
Stage 4 (full suite): 705 passed (up from 699 baseline — +6 new), 6
  pre-existing Windows failures in test_incremental/test_notebook
  (verified identical on unchanged main in PR tirth8205#226 work).
Stage 5 (ruff check): clean.
Stage 6 (fixture smoke parse): 8 nodes, 26 edges. All expected function,
  call, import, and test nodes present.

Zero regressions. All new code lives behind the bash language check so
existing languages are untouched.
@tirth8205
Copy link
Copy Markdown
Owner

Thank you for this @CodeBlackwell — closing as superseded by PR #227 which already shipped Bash/Shell support in v2.3.0 (now on PyPI, along with v2.3.1 that adds the Windows MCP hang fix).

Your PR was in flight at the same time I shipped #227, so this isn't redundant work on your end — the two implementations arrived essentially in parallel. Both approaches are very similar: .sh/.bash/.zsh registered with tree-sitter-bash, function_definition → Function nodes, command → CALLS edges, source/. → IMPORTS_FROM. The shipped version uses _extract_bash_source_command for the import hook and has 7 multilang tests covering the same cases you tested.

If there's anything your implementation does that mine doesn't (e.g. .ksh extension, additional test coverage), please open a follow-up issue pointing at the specific feature and I'll merge it in. The .ksh extension in particular looks worth adding — I didn't include it in #227.

Really appreciate the thorough test breakdown (10/10 targeted, 140/140 multilang, 705 full suite). That's exactly how I'd want a PR tested.

@tirth8205 tirth8205 closed this Apr 11, 2026
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 11, 2026
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries added in tirth8205#227. Korn shell is close enough to
bash syntactically that tree-sitter-bash handles the structural features
the graph captures (function definitions, commands, source/. includes)
correctly.

Context
-------
In the close comment on PR tirth8205#230, @tirth8205 explicitly flagged .ksh as a
missing extension:

    "The .ksh extension in particular looks worth adding — I didn't
     include it in tirth8205#227."

This PR addresses exactly that gap. Issue tirth8205#235 tracks the request.

Why it matters
--------------
Korn shell is still used in legacy AIX/Solaris operations, IBM internal
tooling, and enterprise CI scripts. Repositories that ship .ksh scripts
currently index to 0 nodes because the extension is unrecognized — the
same failure mode that motivated tirth8205#197.

Implementation
--------------
One line added to EXTENSION_TO_LANGUAGE in parser.py:
    ".ksh": "bash"

All of the bash parsing machinery shipped in tirth8205#227 (_FUNCTION_TYPES,
_CALL_TYPES, _extract_bash_source_command, name/call resolution) already
supports any file parsed through the "bash" language path, so no further
changes are needed.

Tests added (tests/test_multilang.py::TestBashParsing)
------------------------------------------------------
1. test_detects_language — extended with a .ksh assertion to lock in
   the extension mapping (regression guard for tirth8205#235).
2. test_ksh_extension_parses_as_bash — end-to-end regression test that
   copies the existing tests/fixtures/sample.sh to a temp .ksh file,
   parses it through the real CodeParser, and asserts:
     - every node's language field is "bash"
     - the set of extracted Function names is identical to the .sh run
     - the CONTAINS / CALLS / IMPORTS_FROM edge counts per kind match
   The second assertion proves the .ksh path is fully wired through to
   the same structural extraction as .sh, not a degenerate zero-result
   read.

Test results
------------
Stage 1 (new targeted tests): 2/2 passed.
Stage 2 (tests/test_multilang.py full): 152/152 passed — zero regressions
  across any language.
Stage 3 (tests/test_parser.py adjacent): 67/67 passed.
Stage 4 (full suite): 733 passed. 8 pre-existing Windows failures in
  test_incremental (3) + test_main async coroutine detection (1) +
  test_notebook Databricks (4) — verified identical on unchanged main.
Stage 5 (ruff check on parser.py and test_multilang.py): clean.
Stage 6 (end-to-end smoke): detect_language("legacy.ksh") -> "bash";
  parsing a real .ksh file produces 6 Function nodes, 18 edges, all
  tagged language=bash.

Zero regressions. Single-line extension mapping change plus a targeted
regression guard against the specific issue the maintainer flagged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Bash/Shell (.sh) parsing for script-heavy repositories

2 participants