# RecursiveJsonSplitter

- Author: [HeeWung Song(Dan)](https://github.com/heewungsong)
- Design: 
- Peer Review:
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

This JSON splitter generates smaller JSON chunks by performing a depth-first traversal of JSON data.

While this splitter attempts to maintain nested JSON objects as much as possible, it will split objects when necessary to keep chunk sizes between min_chunk_size and max_chunk_size. When a value is a very large string rather than nested JSON, that string will not be split.

If strict chunk size limits are required, you may consider using a Recursive Text Splitter to process these chunks after using this splitter.

**Splitting Criteria**

1. Text Splitting Method: Based on JSON values
2. Chunk Size Measurement: Based on character count


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Basic JSON Splitting](#basic-json-splitting)
- [Handling JSON Structure](#handling-json-structure)


### References

- [Langchain RecursiveJsonSplitter](https://python.langchain.com/api_reference/text_splitters/json/langchain_text_splitters.json.RecursiveJsonSplitter.html#langchain_text_splitters.json.RecursiveJsonSplitter)
- [Langchain How-to-split-JSONdata](https://python.langchain.com/docs/how_to/recursive_json_splitter/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ]
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RecursiveJsonSplitter",
    }
)

You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [None]:
from dotenv import load_dotenv

load_dotenv()

True

## Basic JSON Splitting

Let's explore the basic methods of splitting JSON data using `RecursiveJsonSplitter`.

- JSON data preparation
- `RecursiveJsonSplitter` configuration
- Three splitting methods (split_json, create_documents, split_text)
- Chunk size verification

In [1]:
import requests

# Load the JSON data.
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

In [None]:
json_data

This is an example of splitting JSON data using `RecursiveJsonSplitter`.

In [3]:
from langchain_text_splitters import RecursiveJsonSplitter

# Create a RecursiveJsonSplitter object that splits JSON data into chunks with a maximum size of 300
splitter = RecursiveJsonSplitter(max_chunk_size=300)

Use `splitter.split_json()` function to recursively split JSON data.

In [4]:
# Recursively split JSON data. Use this when you need to access or manipulate small JSON fragments.
json_chunks = splitter.split_json(json_data=json_data)

- Use `splitter.create_documents()` method to convert JSON data into document format.
- Use `splitter.split_text()` method to split JSON data into a list of strings.

In [5]:
# Create documents based on JSON data.
docs = splitter.create_documents(texts=[json_data])

# Create string chunks based on JSON data.
texts = splitter.split_text(json_data=json_data)

# Print the first string.
print(docs[0].page_content)

print("===" * 20)

# Print the split string chunks.
print(texts[0])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}


## Handling JSON Structure

Let's explore how `RecursiveJsonSplitter` handles different JSON structures and their limitations.

- List object size verification
- JSON structure parsing
- Using convert_lists parameter for list transformation

By examining `texts[2]`, which is one of the larger chunks, we can confirm that it contains a list object.

- The reason why the second chunk exceeds the size limit (300) is because it contains a list object.
- This is due to the fact that `RecursiveJsonSplitter` does not split list objects.

In [6]:
# Let's check the size of the chunks
print([len(text) for text in texts][:10])

# When examining one of the larger chunks, we can see that it contains a list object
print(texts[2])

[232, 197, 469, 210, 213, 237, 271, 191, 232, 215]
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}


You can parse the chunk at index 2 using the json module as follows:

In [7]:
import json

json_data = json.loads(texts[2])
json_data["paths"]

{'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id',
     'in': 'path',
     'required': True,
     'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
    {'name': 'include_stats',
     'in': 'query',
     'required': False,
     'schema': {'type': 'boolean',
      'default': False,
      'title': 'Include Stats'}},
    {'name': 'accept',
     'in': 'header',
     'required': False,
     'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
      'title': 'Accept'}}]}}}

You can convert lists in JSON to `key:value` pairs in the form of `index:item` by setting the `convert_lists` parameter to `True`.

In [27]:
# The following preprocesses JSON and converts lists into dictionaries with index:item as key:value pairs
texts = splitter.split_text(json_data=json_data, convert_lists=True)

In [28]:
# The list has been converted to a dictionary, and we'll check the result.
print(texts[2])

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": {"2": {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": {"0": {"type": "string"}, "1": {"type": "null"}}, "title": "Accept"}}}}}}}


You can check the document at a specific index in the `docs` list.

In [30]:
# Check the document at index 2.
print(docs[2])

page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'
