# JSON

Let's look at how to load files with the `.json` extension using a loader.

- Author: [leebeanbin](https://github.com/leebeanbin)
- Design:
- Peer Review : [syshin0116](https://github.com/syshin0116), [Teddy Lee](https://github.com/teddylee777)
- Proofread: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/tree/main/06-DocumentLoader)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/10-JSON-Loader.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/10-JSON-Loader.ipynb)

## Environment Setup

Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.

**[Note]**
- The `langchain-opentutorial` is a bundle of easy-to-use environment setup guidance, useful functions and utilities for tutorials.
- Check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

## Overview
This tutorial demonstrates how to use LangChain's JSONLoader to load and process JSON files. We'll explore how to extract specific data from structured JSON files using jq-style queries.

### Table of Contents
- [Environment Set up](#environment-setup)
- [JSON](#json)
- [Overview](#overview)
- [Generate JSON Data](#generate-json-data)
- [JSONLoader](#jsonloader)
  
When you want to extract values under the content field within the message key of JSON data, you can easily do this using JSONLoader as shown below.


### reference
- https://python.langchain.com/docs/how_to/document_loader_json/

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

[Note]
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [2]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [5]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_community",
        "langchain_openai"
    ],
    verbose=False,
    upgrade=True,
)

In [4]:
%pip install rq

Note: you may need to restart the kernel to use updated packages.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "09-JSONLoader",
    }
)

In [6]:
# Load environment variables
# Reload any variables that need to be overwritten from the previous cell

from dotenv import load_dotenv

load_dotenv(override=True)

True

## Generate JSON Data

---

if you want to generate JSON data, you can use the following code.


In [7]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from pathlib import Path
from dotenv import load_dotenv
from pprint import pprint
import json
import os

# Load .env file
load_dotenv()

# Initialize ChatOpenAI
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,
    model_kwargs={"response_format": {"type": "json_object"}}
)

# Create prompt template
prompt = PromptTemplate(
    input_variables=[],
    template="""Generate a JSON array containing detailed personal information for 5 people. 
        Include various fields like name, age, contact details, address, personal preferences, and any other interesting information you think would be relevant."""
)

# Create and invoke runnable sequence using the new pipe syntax
response = (prompt | llm).invoke({})
generated_data = json.loads(response.content)

# Save to JSON file
current_dir = Path().absolute()
data_dir = current_dir / "data"
data_dir.mkdir(exist_ok=True)

file_path = data_dir / "people.json"
with open(file_path, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, ensure_ascii=False, indent=2)

print("Generated and saved JSON data:")
pprint(generated_data)

Generated and saved JSON data:
{'people': [{'address': {'city': 'Springfield',
                         'state': 'IL',
                         'street': '123 Maple Street',
                         'zipCode': '62701'},
             'age': 29,
             'contactDetails': {'email': 'alice.johnson@example.com',
                                'phone': '+1-555-0123'},
             'interestingFacts': ['Has visited 15 countries',
                                  'Speaks 3 languages fluently',
                                  'Loves to try new recipes every week'],
             'name': 'Alice Johnson',
             'personalPreferences': {'favoriteColor': 'Blue',
                                     'favoriteCuisine': 'Italian',
                                     'hobbies': ['Photography',
                                                 'Traveling',
                                                 'Cooking']}},
            {'address': {'city': 'Metropolis',
                         

The case of loading JSON data is as follows when you want to load your own JSON data.

In [8]:
import json
from pathlib import Path
from pprint import pprint


file_path = "data/people.json"
data = json.loads(Path(file_path).read_text())

pprint(data)

{'people': [{'address': {'city': 'Springfield',
                         'state': 'IL',
                         'street': '123 Maple Street',
                         'zipCode': '62701'},
             'age': 29,
             'contactDetails': {'email': 'alice.johnson@example.com',
                                'phone': '+1-555-0123'},
             'interestingFacts': ['Has visited 15 countries',
                                  'Speaks 3 languages fluently',
                                  'Loves to try new recipes every week'],
             'name': 'Alice Johnson',
             'personalPreferences': {'favoriteColor': 'Blue',
                                     'favoriteCuisine': 'Italian',
                                     'hobbies': ['Photography',
                                                 'Traveling',
                                                 'Cooking']}},
            {'address': {'city': 'Metropolis',
                         'state': 'NY',
                

In [9]:
print(type(data))

<class 'dict'>


## JSONLoader

---

When you want to extract values under the content field within the message key of JSON data, you can easily do this using JSONLoader as shown below.

### Basic Usage

In [10]:
from langchain_community.document_loaders import JSONLoader

# Create JSONLoader
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",  # Access each item in the people array
    text_content=False,
)

# Load documents
docs = loader.load()
pprint(docs)

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "age": 29, "contactDetails": {"email": "alice.johnson@example.com", "phone": "+1-555-0123"}, "address": {"street": "123 Maple Street", "city": "Springfield", "state": "IL", "zipCode": "62701"}, "personalPreferences": {"favoriteColor": "Blue", "hobbies": ["Photography", "Traveling", "Cooking"], "favoriteCuisine": "Italian"}, "interestingFacts": ["Has visited 15 countries", "Speaks 3 languages fluently", "Loves to try new recipes every week"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "age": 34, "contactDetails": {"email": "bob.smith@example.com", "phone": "+1-555-0456"}, "address": {"street": "456 Oak Avenue", "city": "Metropolis", "state": "NY", "zipCode": "10001"}, "personalPreferences

### Loading Each Person as a Separate Document

We can load each person object from `people.json` as an individual document using the `jq_schema=".people[]"`

In [7]:
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    text_content=False,
)

data = loader.load()
data

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "age": 29, "contactDetails": {"email": "alice.johnson@example.com", "phone": "+1-555-0123"}, "address": {"street": "123 Maple Street", "city": "Springfield", "state": "IL", "zipCode": "62701"}, "personalPreferences": {"favoriteColor": "Blue", "hobbies": ["Photography", "Traveling", "Cooking"], "favoriteCuisine": "Italian"}, "interestingFacts": ["Has visited 15 countries", "Speaks 3 languages fluently", "Loves to try new recipes every week"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "age": 34, "contactDetails": {"email": "bob.smith@example.com", "phone": "+1-555-0456"}, "address": {"street": "456 Oak Avenue", "city": "Metropolis", "state": "NY", "zipCode": "10001"}, "personalPreferences

### Using `content_key` within `jq_schema`

To load documents from a JSON file using `content_key` within the `jq_schema`, set `is_content_key_jq_parsable=True`. Ensure that `content_key` is compatible and can be parsed using the `jq_schema`.

In [8]:
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    content_key="name",
    text_content=False
)

data = loader.load()
data

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='Alice Johnson'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='Bob Smith'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3}, page_content='Charlie Davis'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 4}, page_content='Diana Prince'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 5}, page_content='Ethan Hunt')]

### Extracting Metadata from `people.json`

Let's define a `metadata_func` to extract relevant information like name, age, and city from each person object.


In [9]:
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["name"] = record.get("name")
    metadata["age"] = record.get("age")
    metadata["city"] = record.get("address", {}).get("city")
    return metadata

loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    content_key="name",
    metadata_func=metadata_func,
    text_content=False
)

data = loader.load()
data

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1, 'name': 'Alice Johnson', 'age': 29, 'city': 'Springfield'}, page_content='Alice Johnson'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2, 'name': 'Bob Smith', 'age': 34, 'city': 'Metropolis'}, page_content='Bob Smith'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3, 'name': 'Charlie Davis', 'age': 42, 'city': 'Gotham'}, page_content='Charlie Davis'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 4, 'name': 'Diana Prince', 'age': 26, 'city': 'Themyscira'}, page_content='Diana Prince'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 's

### Understanding JSON Query Syntax

Let's explore the basic syntax of jq-style queries used in JSONLoader:

1. Basic Selectors
   - `.` : Current object
   - `.key` : Access specific key in object
   - `.[]` : Iterate over array elements

2. Pipe Operator
   - `|` : Pass result of left expression as input to right expression
   
3. Object Construction
   - `{key: value}` : Create new object

Example JSON:
```json
{
  "people": [
    {"name": "Alice", "age": 30, "contactDetails": {"email": "alice@example.com", "phone": "123-456-7890"}},
    {"name": "Bob", "age": 25, "contactDetails": {"email": "bob@example.com", "phone": "098-765-4321"}}
  ]
}
```

**Common Query Patterns**:
- `.people[]` : Access each array element
- `.people[].name` : Get all names
- `.people[] | {name: .name}` : Create new object with name
- `.people[] | {name, email: .contact.email}` : Extract nested data

[Note] 
- Always use `text_content=False` when working with complex JSON data
- This ensures proper handling of non-string values (objects, arrays, numbers)

### Advanced Queries

Here are examples of extracting specific information using different jq schemas:

In [22]:
# Extract only contact details
contact_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, contact: .contactDetails}",
    text_content=False
)

docs = contact_loader.load()
docs

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "contact": {"email": "alice.johnson@example.com", "phone": "+1-555-0123"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "contact": {"email": "bob.smith@example.com", "phone": "+1-555-0456"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3}, page_content='{"name": "Charlie Davis", "contact": {"email": "charlie.davis@example.com", "phone": "+1-555-0789"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 4}, page_content='{"name": "Diana Prince", "contact": {"email": "diana.prince@example.com", "phone": "+1-555-0912"}}'),
 D

In [21]:
docs = loader.load()

# Extract nested data
hobbies_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, hobbies: .personalPreferences.hobbies}",
    text_content=False
)

docs = hobbies_loader.load()
docs

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "hobbies": ["Photography", "Traveling", "Cooking"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "hobbies": ["Cycling", "Reading", "Gaming"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3}, page_content='{"name": "Charlie Davis", "hobbies": ["Hiking", "Fishing", "Woodworking"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 4}, page_content='{"name": "Diana Prince", "hobbies": ["Martial Arts", "Reading", "Volunteering"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLo

In [23]:
# Get all interesting facts
facts_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, facts: .interestingFacts}",
    text_content=False
)

docs = facts_loader.load()
docs

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "facts": ["Has visited 15 countries", "Speaks 3 languages fluently", "Loves to try new recipes every week"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "facts": ["Completed a marathon last year", "Is a certified yoga instructor", "Has a collection of over 200 video games"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3}, page_content='{"name": "Charlie Davis", "facts": ["Has built his own cabin in the woods", "Loves to go fishing every weekend", "Is a member of a local hiking club"]}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.jso

In [25]:
# Extract email and phone together
contact_info = JSONLoader(
    file_path="data/people.json",
    jq_schema='.people[] | {name: .name, email: .contactDetails.email, phone: .contactDetails.phone}',
    text_content=False
)

docs = contact_loader.load()
docs

[Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "contact": {"email": "alice.johnson@example.com", "phone": "+1-555-0123"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 2}, page_content='{"name": "Bob Smith", "contact": {"email": "bob.smith@example.com", "phone": "+1-555-0456"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 3}, page_content='{"name": "Charlie Davis", "contact": {"email": "charlie.davis@example.com", "phone": "+1-555-0789"}}'),
 Document(metadata={'source': '/Users/leejungbin/Downloads/LangChain-OpenTutorial/06-DocumentLoader/data/people.json', 'seq_num': 4}, page_content='{"name": "Diana Prince", "contact": {"email": "diana.prince@example.com", "phone": "+1-555-0912"}}'),
 D

These examples demonstrate the flexibility of jq queries in fetching data in various ways.