# **Example**: Splitting a Python Source File into Chunks with `CodeSplitter`

Suppose you have a Python code file and want to split it into chunks that respect function and class boundaries (rather than just splitting every N characters). The [**`CodeSplitter`**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#codesplitter) leverages [LangChain's RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/code_splitter/) to achieve this, making it ideal for preparing code for LLM ingestion, code review, or annotation.

![Programming languages](https://bairesdev.mo.cloudinary.net/blog/2020/10/top-programming-languages.png?tx=w_1920,q_auto)

---

## Step 1: Read the Python Source File

We will use the [**`VanillaReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#vanillareader) to load our code file. You can provide a local file path (or a URL if your implementation supports it).

!!! Note
    In case that you use [`MarkItDownReader`](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#markitdownreader) or [`DoclingReader`](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#doclingreader), save your files in `txt` format.

In [1]:
from splitter_mr.reader import VanillaReader

reader = VanillaReader()
reader_output = reader.read(
    "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/code_example.py"
)


The [`reader_output`](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#splitter_mr.schema.models.ReaderOutput) is an object containing the raw code and its metadata:

In [2]:
print(reader_output.model_dump_json(indent=4))

{
    "text": "from langchain_text_splitters import Language, RecursiveCharacterTextSplitter\n\nfrom ...schema import ReaderOutput, SplitterOutput\nfrom ..base_splitter import BaseSplitter\n\n\ndef get_langchain_language(lang_str: str) -> Language:\n    \"\"\"\n    Map a string language name to Langchain Language enum.\n    Raises ValueError if not found.\n    \"\"\"\n    lookup = {lang.name.lower(): lang for lang in Language}\n    key = lang_str.lower()\n    if key not in lookup:\n        raise ValueError(\n            f\"Unsupported language '{lang_str}'. Supported: {list(lookup.keys())}\"\n        )\n    return lookup[key]\n\n\nclass CodeSplitter(BaseSplitter):\n    \"\"\"\n    CodeSplitter recursively splits source code into programmatically meaningful chunks\n    (functions, classes, methods, etc.) for the given programming language.\n\n    Args:\n        chunk_size (int): Maximum chunk size, in characters.\n        language (str): Programming language (e.g., \"python\", \"java\",


To see the code content:

In [3]:
print(reader_output.text)

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

from ...schema import ReaderOutput, SplitterOutput
from ..base_splitter import BaseSplitter


def get_langchain_language(lang_str: str) -> Language:
    """
    Map a string language name to Langchain Language enum.
    Raises ValueError if not found.
    """
    lookup = {lang.name.lower(): lang for lang in Language}
    key = lang_str.lower()
    if key not in lookup:
        raise ValueError(
            f"Unsupported language '{lang_str}'. Supported: {list(lookup.keys())}"
        )
    return lookup[key]


class CodeSplitter(BaseSplitter):
    """
    CodeSplitter recursively splits source code into programmatically meaningful chunks
    (functions, classes, methods, etc.) for the given programming language.

    Args:
        chunk_size (int): Maximum chunk size, in characters.
        language (str): Programming language (e.g., "python", "java", "kotlin", etc.)

    Notes:
        - Uses Langchain's R


---

## Step 2: Chunk the Code Using `CodeSplitter`

To split your code by language-aware logical units, instantiate the `CodeSplitter`, specifying the `chunk_size` (maximum number of characters per chunk) and `language` (e.g., `"python"`):

In [4]:
from splitter_mr.splitter import CodeSplitter

splitter = CodeSplitter(chunk_size=1000, language="python")
splitter_output = splitter.split(reader_output)


The [`splitter_output`](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#splitter_mr.schema.models.ReaderOutput) contains the split code chunks:

In [5]:
print(splitter_output)

chunks=['from langchain_text_splitters import Language, RecursiveCharacterTextSplitter\n\nfrom ...schema import ReaderOutput, SplitterOutput\nfrom ..base_splitter import BaseSplitter\n\n\ndef get_langchain_language(lang_str: str) -> Language:\n    """\n    Map a string language name to Langchain Language enum.\n    Raises ValueError if not found.\n    """\n    lookup = {lang.name.lower(): lang for lang in Language}\n    key = lang_str.lower()\n    if key not in lookup:\n        raise ValueError(\n            f"Unsupported language \'{lang_str}\'. Supported: {list(lookup.keys())}"\n        )\n    return lookup[key]', 'class CodeSplitter(BaseSplitter):\n    """\n    CodeSplitter recursively splits source code into programmatically meaningful chunks\n    (functions, classes, methods, etc.) for the given programming language.\n\n    Args:\n        chunk_size (int): Maximum chunk size, in characters.\n        language (str): Programming language (e.g., "python", "java", "kotlin", etc.)\n\n 


To inspect the split results, iterate over the chunks and print them:

In [6]:
for idx, chunk in enumerate(splitter_output.chunks):
    print("=" * 40 + f" Chunk {idx} " + "=" * 40)
    print(chunk)

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

from ...schema import ReaderOutput, SplitterOutput
from ..base_splitter import BaseSplitter


def get_langchain_language(lang_str: str) -> Language:
    """
    Map a string language name to Langchain Language enum.
    Raises ValueError if not found.
    """
    lookup = {lang.name.lower(): lang for lang in Language}
    key = lang_str.lower()
    if key not in lookup:
        raise ValueError(
            f"Unsupported language '{lang_str}'. Supported: {list(lookup.keys())}"
        )
    return lookup[key]
class CodeSplitter(BaseSplitter):
    """
    CodeSplitter recursively splits source code into programmatically meaningful chunks
    (functions, classes, methods, etc.) for the given programming language.

    Args:
        chunk_size (int): Maximum chunk size, in characters.
        language (str): Programming language (e.g., "python", "java", "kotlin", etc.)

    Notes:
        - Uses Langchain's Rec


**And that's it!** You now have an efficient, language-aware way to chunk your code files for downstream tasks. 

Remember that you have plenty of programming languages available: **JavaScript, Go, Rust, Java**, etc. Currently, the available programming languages are:

In [7]:
from typing import Set

SUPPORTED_PROGRAMMING_LANGUAGES: Set[str] = {
    "lua",
    "java",
    "ts",
    "tsx",
    "ps1",
    "psm1",
    "psd1",
    "ps1xml",
    "php",
    "php3",
    "php4",
    "php5",
    "phps",
    "phtml",
    "rs",
    "cs",
    "csx",
    "cob",
    "cbl",
    "hs",
    "scala",
    "swift",
    "tex",
    "rb",
    "erb",
    "kt",
    "kts",
    "go",
    "html",
    "htm",
    "rst",
    "ex",
    "exs",
    "md",
    "markdown",
    "proto",
    "sol",
    "c",
    "h",
    "cpp",
    "cc",
    "cxx",
    "c++",
    "hpp",
    "hh",
    "hxx",
    "js",
    "mjs",
    "py",
    "pyw",
    "pyc",
    "pyo",
    "pl",
    "pm",
}


!!! Note

    Remember that you can visit the [LangchainTextSplitter documentation](https://python.langchain.com/docs/how_to/code_splitter/) to see the up-to-date information about the available programming languages to split on.

## Complete Script

Here is a full example you can run directly:

```python
from splitter_mr.reader import VanillaReader
from splitter_mr.splitter import CodeSplitter

# Step 1: Read the code file
reader = VanillaReader()
reader_output = reader.read("https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/code_example.py")

print(reader_output.model_dump_json(indent=4))  # See metadata
print(reader_output.text)  # See raw code

# Step 2: Split code into logical chunks, max 1000 chars per chunk
splitter = CodeSplitter(chunk_size=1000, language="python")
splitter_output = splitter.split(reader_output)

print(splitter_output)  # Print the SplitterOutput object

# Step 3: Visualize code chunks
for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx} " + "="*40)
    print(chunk)
```


### References

[LangChain's RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/code_splitter/) 