Skip to content

Commit

Permalink
Framework for supporting more languages in LanguageParser (langchain-…
Browse files Browse the repository at this point in the history
…ai#13318)

## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (langchain-ai#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
ThatsJustCheesy#2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
ThatsJustCheesy#1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes langchain-ai#11229
- Closes langchain-ai#10996
- Closes langchain-ai#8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
langchain-ai#6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
  • Loading branch information
9 people authored and snsten committed Feb 15, 2024
1 parent 6f20457 commit 6bb7450
Show file tree
Hide file tree
Showing 29 changed files with 1,464 additions and 13 deletions.
61 changes: 58 additions & 3 deletions docs/docs/integrations/document_loaders/source_code.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,35 @@
"\n",
"This notebook covers how to load source code files using a special approach with language parsing: each top-level function and class in the code is loaded into separate documents. Any remaining code top-level code outside the already loaded functions and classes will be loaded into a separate document.\n",
"\n",
"This approach can potentially improve the accuracy of QA models over source code. Currently, the supported languages for code parsing are Python and JavaScript. The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax."
"This approach can potentially improve the accuracy of QA models over source code.\n",
"\n",
"The supported languages for code parsing are:\n",
"\n",
"- C (*)\n",
"- C++ (*)\n",
"- C# (*)\n",
"- COBOL\n",
"- Go (*)\n",
"- Java (*)\n",
"- JavaScript (requires package `esprima`)\n",
"- Kotlin (*)\n",
"- Lua (*)\n",
"- Perl (*)\n",
"- Python\n",
"- Ruby (*)\n",
"- Rust (*)\n",
"- Scala (*)\n",
"- TypeScript (*)\n",
"\n",
"Items marked with (*) require the packages `tree_sitter` and `tree_sitter_languages`.\n",
"It is straightforward to add support for additional languages using `tree_sitter`,\n",
"although this currently requires modifying LangChain.\n",
"\n",
"The language used for parsing can be configured, along with the minimum number of\n",
"lines required to activate the splitting based on syntax.\n",
"\n",
"If a language is not explicitly specified, `LanguageParser` will infer one from\n",
"filename extensions, if present."
]
},
{
Expand All @@ -19,7 +47,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet esprima"
"%pip install -qU esprima esprima tree_sitter tree_sitter_languages"
]
},
{
Expand Down Expand Up @@ -395,6 +423,33 @@
"source": [
"print(\"\\n\\n--8<--\\n\\n\".join([document.page_content for document in result]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding Languages using Tree-sitter Template\n",
"\n",
"Expanding language support using the Tree-Sitter template involves a few essential steps:\n",
"\n",
"1. **Creating a New Language File**:\n",
" - Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).\n",
" - Model this file based on the structure and parsing logic of existing language files like **`cpp.py`**.\n",
" - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).\n",
"2. **Parsing Language Specifics**:\n",
" - Mimic the structure used in the **`cpp.py`** file, adapting it to suit the language you are incorporating.\n",
" - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.\n",
"3. **Testing the Language Parser**:\n",
" - For thorough validation, generate a test file specific to the new language. Create **`test_language.py`** in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language).\n",
" - Follow the example set by **`test_cpp.py`** to establish fundamental tests for the parsed elements in the new language.\n",
"4. **Integration into the Parser and Text Splitter**:\n",
" - Incorporate your new language within the **`language_parser.py`** file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language.\n",
" - Also, confirm that your language is included in **`text_splitter.py`** in class Language for proper parsing.\n",
"\n",
"By following these steps and ensuring comprehensive testing and integration, you'll successfully extend language support using the Tree-Sitter template.\n",
"\n",
"Best of luck!"
]
}
],
"metadata": {
Expand All @@ -413,7 +468,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
"version": "3.11.5"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(struct_specifier
body: (field_declaration_list)) @struct
(enum_specifier
body: (enumerator_list)) @enum
(union_specifier
body: (field_declaration_list)) @union
(function_definition) @function
]
""".strip()


class CSegmenter(TreeSitterSegmenter):
"""Code segmenter for C."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("c")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(class_specifier
body: (field_declaration_list)) @class
(struct_specifier
body: (field_declaration_list)) @struct
(union_specifier
body: (field_declaration_list)) @union
(function_definition) @function
]
""".strip()


class CPPSegmenter(TreeSitterSegmenter):
"""Code segmenter for C++."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("cpp")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(namespace_declaration) @namespace
(class_declaration) @class
(method_declaration) @method
(interface_declaration) @interface
(enum_declaration) @enum
(struct_declaration) @struct
(record_declaration) @record
]
""".strip()


class CSharpSegmenter(TreeSitterSegmenter):
"""Code segmenter for C#."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("c_sharp")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(function_declaration) @function
(type_declaration) @type
]
""".strip()


class GoSegmenter(TreeSitterSegmenter):
"""Code segmenter for Go."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("go")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(class_declaration) @class
(interface_declaration) @interface
(enum_declaration) @enum
]
""".strip()


class JavaSegmenter(TreeSitterSegmenter):
"""Code segmenter for Java."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("java")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from typing import TYPE_CHECKING

from langchain_community.document_loaders.parsers.language.tree_sitter_segmenter import ( # noqa: E501
TreeSitterSegmenter,
)

if TYPE_CHECKING:
from tree_sitter import Language


CHUNK_QUERY = """
[
(function_declaration) @function
(class_declaration) @class
]
""".strip()


class KotlinSegmenter(TreeSitterSegmenter):
"""Code segmenter for Kotlin."""

def get_language(self) -> "Language":
from tree_sitter_languages import get_language

return get_language("kotlin")

def get_chunk_query(self) -> str:
return CHUNK_QUERY

def make_line_comment(self, text: str) -> str:
return f"// {text}"
Loading

0 comments on commit 6bb7450

Please sign in to comment.