forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Framework for supporting more languages in LanguageParser (langchain-…
…ai#13318) ## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (langchain-ai#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 ThatsJustCheesy#2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee ThatsJustCheesy#1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes langchain-ai#11229 - Closes langchain-ai#10996 - Closes langchain-ai#8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed langchain-ai#6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway. <!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
- Loading branch information
1 parent
6f20457
commit 6bb7450
Showing
29 changed files
with
1,464 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
36 changes: 36 additions & 0 deletions
36
libs/community/langchain_community/document_loaders/parsers/language/c.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(struct_specifier | ||
body: (field_declaration_list)) @struct | ||
(enum_specifier | ||
body: (enumerator_list)) @enum | ||
(union_specifier | ||
body: (field_declaration_list)) @union | ||
(function_definition) @function | ||
] | ||
""".strip() | ||
|
||
|
||
class CSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for C.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("c") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
36 changes: 36 additions & 0 deletions
36
libs/community/langchain_community/document_loaders/parsers/language/cpp.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(class_specifier | ||
body: (field_declaration_list)) @class | ||
(struct_specifier | ||
body: (field_declaration_list)) @struct | ||
(union_specifier | ||
body: (field_declaration_list)) @union | ||
(function_definition) @function | ||
] | ||
""".strip() | ||
|
||
|
||
class CPPSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for C++.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("cpp") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
36 changes: 36 additions & 0 deletions
36
libs/community/langchain_community/document_loaders/parsers/language/csharp.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(namespace_declaration) @namespace | ||
(class_declaration) @class | ||
(method_declaration) @method | ||
(interface_declaration) @interface | ||
(enum_declaration) @enum | ||
(struct_declaration) @struct | ||
(record_declaration) @record | ||
] | ||
""".strip() | ||
|
||
|
||
class CSharpSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for C#.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("c_sharp") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
31 changes: 31 additions & 0 deletions
31
libs/community/langchain_community/document_loaders/parsers/language/go.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(function_declaration) @function | ||
(type_declaration) @type | ||
] | ||
""".strip() | ||
|
||
|
||
class GoSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for Go.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("go") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
32 changes: 32 additions & 0 deletions
32
libs/community/langchain_community/document_loaders/parsers/language/java.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(class_declaration) @class | ||
(interface_declaration) @interface | ||
(enum_declaration) @enum | ||
] | ||
""".strip() | ||
|
||
|
||
class JavaSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for Java.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("java") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
31 changes: 31 additions & 0 deletions
31
libs/community/langchain_community/document_loaders/parsers/language/kotlin.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
from typing import TYPE_CHECKING | ||
|
||
from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501 | ||
TreeSitterSegmenter, | ||
) | ||
|
||
if TYPE_CHECKING: | ||
from tree_sitter import Language | ||
|
||
|
||
CHUNK_QUERY = """ | ||
[ | ||
(function_declaration) @function | ||
(class_declaration) @class | ||
] | ||
""".strip() | ||
|
||
|
||
class KotlinSegmenter(TreeSitterSegmenter): | ||
"""Code segmenter for Kotlin.""" | ||
|
||
def get_language(self) -> "Language": | ||
from tree_sitter_languages import get_language | ||
|
||
return get_language("kotlin") | ||
|
||
def get_chunk_query(self) -> str: | ||
return CHUNK_QUERY | ||
|
||
def make_line_comment(self, text: str) -> str: | ||
return f"// {text}" |
Oops, something went wrong.