# Tutorial 11: Working with Structured Data

In this tutorial, we'll explore how to work with structured data in LangChain and LangGraph applications. We'll use Pydantic for data modeling, create structured inputs and outputs, and leverage the JSON Toolkit for complex data manipulation.

## Setup

First, let's import the necessary libraries and set up our environment:

In [37]:
import os
from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain.tools import BaseTool, StructuredTool, Tool
from langchain.agents import AgentExecutor, Tool
from langchain.agents.structured_chat.base import StructuredChatAgent
from langgraph.graph import StateGraph, END

# Initialize Groq LLM
llm = ChatGroq(
        model_name="llama3-70b-8192",
        temperature=0.7,
        model_kwargs={"top_p": 0.8, "seed": 1337}
    )

## 1. Introduction to Pydantic for data modeling

Pydantic is a powerful library for data validation and settings management using Python type annotations. Let's start by creating a simple Pydantic model:

In [8]:
class Person(BaseModel):
    name: str
    age: int
    email: Optional[str] = None

# Create a Person instance
alice = Person(name="Alice", age=30, email="alice@example.com")
print(alice)

# Pydantic will raise a validation error if the data doesn't match the model
try:
    invalid_person = Person(name="Bob", age="thirty")
except ValueError as e:
    print(f"Validation error: {e}")

name='Alice' age=30 email='alice@example.com'
Validation error: 1 validation error for Person
age
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='thirty', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/int_parsing


## 2. Creating structured inputs and outputs with Pydantic

Now, let's use Pydantic with LangChain to create structured outputs from our LLM:

In [9]:
class MovieReview(BaseModel):
    title: str = Field(description="The title of the movie")
    year: int = Field(description="The year the movie was released")
    rating: float = Field(description="The rating of the movie on a scale of 0 to 10")
    summary: str = Field(description="A brief summary of the movie's plot")

movie_review_parser = PydanticOutputParser(pydantic_object=MovieReview)

movie_review_prompt = ChatPromptTemplate.from_template(
    """Provide a brief review for the movie {movie_title}. 
    {format_instructions}
    """
)

movie_review_chain = LLMChain(
    llm=llm,
    prompt=movie_review_prompt,
    output_parser=movie_review_parser
)

result = movie_review_chain.run(
    movie_title="The Matrix",
    format_instructions=movie_review_parser.get_format_instructions()
)

print(result)

title='The Matrix' year=1999 rating=9.2 summary='In a dystopian future, humanity is unknowingly trapped within a simulated reality called the Matrix. A computer hacker named Neo is eventually awakened to the truth and must join a group of rebels to free humanity from its enslavement.'


## 3. Using the JSON Toolkit for complex data manipulation

LangChain provides a JSON Toolkit for working with complex JSON data. Let's create a simple example using this toolkit:

In [19]:
def generate_json_format_instructions(json_data, parent_key=''):
    """
    Generate format instructions for JSON data structure.
    Args:
        json_data: JSON data to analyze
        parent_key: Key of parent object for nested structures
    Returns:
        dict: Format instructions for the JSON structure
    """
    instructions = {}
    
    if isinstance(json_data, dict):
        for key, value in json_data.items():
            current_key = f"{parent_key}.{key}" if parent_key else key
            
            if isinstance(value, dict):
                instructions[key] = {
                    'type': 'object',
                    'properties': generate_json_format_instructions(value, current_key)
                }
            elif isinstance(value, list):
                if value and isinstance(value[0], dict):
                    instructions[key] = {
                        'type': 'array',
                        'items': generate_json_format_instructions(value[0], current_key)
                    }
                else:
                    instructions[key] = {
                        'type': 'array',
                        'items': {'type': type(value[0]).__name__}
                    }
            else:
                instructions[key] = {
                    'type': type(value).__name__,
                    'example': str(value)
                }
                
    return instructions



In [38]:
import json
from langchain.tools.json.tool import JsonSpec
from langchain.agents import create_json_agent
from langchain.agents.agent_toolkits import JsonToolkit

# Sample JSON data
# json_data = {
#     "movie": [
#         {"title": "Inception", "director": "Christopher Nolan", "year": 2010},
#         {"title": "The Shawshank Redemption", "director": "Frank Darabont", "year": 1994},
#         {"title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972}
#     ]
# }
# Sample JSON data
# json_data = {
#   "movies": {
#     "inception": {
#       "id": "mv001",
#       "title": "Inception",
#       "director": "Christopher Nolan",
#       "year": 2010,
#       "meta": {
#         "lastUpdated": "2024-04-01",
#         "type": "feature_film"
#       }
#     },
#     "shawshank": {
#       "id": "mv002",
#       "title": "The Shawshank Redemption",
#       "director": "Frank Darabont",
#       "year": 1994,
#       "meta": {
#         "lastUpdated": "2024-04-01",
#         "type": "feature_film"
#       }
#     },
#     "godfather": {
#       "id": "mv003",
#       "title": "The Godfather",
#       "director": "Francis Ford Coppola",
#       "year": 1972,
#       "meta": {
#         "lastUpdated": "2024-04-01",
#         "type": "feature_film"
#       }
#     }
#   }}

# Usage example
def get_format_instructions(json_data):
    schema = generate_json_format_instructions(json_data)
    
    instructions = (
        "Your response should be formatted as a JSON object with the following structure:\n"
        f"{json.dumps(schema)}"
        "All fields are required unless explicitly marked as optional."
    )
    
    return instructions

# Example usage with your movie data
format_instructions = get_format_instructions(json_data)
print(format_instructions)

json_spec = JsonSpec(dict_=json_data, max_value_length=4000)
json_toolkit = JsonToolkit(spec=json_spec)

json_agent_executor = create_json_agent(
  llm=llm,
  toolkit=json_toolkit,
  verbose=True,
)

result = json_agent_executor.run("What is the oldest movie in the list?")
print(result)

Your response should be formatted as a JSON object with the following structure:
{"movie": {"type": "array", "items": {"title": {"type": "str", "example": "Inception"}, "director": {"type": "str", "example": "Christopher Nolan"}, "year": {"type": "int", "example": "2010"}}}}All fields are required unless explicitly marked as optional.


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction: json_spec_list_keys
Action Input: data[0m
Observation: [36;1m[1;3m['movie'][0m
Thought:[32;1m[1;3mI see that there is a key called 'movie'. Let me explore it further.

Action: json_spec_list_keys
Action Input: data["movie"][0m
Observation: [36;1m[1;3mValueError('Value at path `data["movie"]` is not a dict, get the value directly.')[0m
Thought:[32;1m[1;3mIt seems that `data["movie"]` is not a dictionary, so I need to get its value directly.

Action: json_spec_get_value
Action Input: data["movie"][0m
Observation: [33;1m[1;3m[{'title': 'Inception', 'director': 'Christopher No

## 4. Integrating structured data with LangChain and LangGraph

Now, let's create a more complex example that combines structured data with LangChain and LangGraph. We'll build a movie recommendation system:

In [8]:
class Movie(BaseModel):
    title: str
    genre: str
    year: int
    director: str

class MovieRecommendation(BaseModel):
    recommended_movie: Movie
    reason: str

class MovieDatabase:
    def __init__(self):
        self.movies = [
            Movie(title="The Shawshank Redemption", genre="Drama", year=1994, director="Frank Darabont"),
            Movie(title="The Godfather", genre="Crime", year=1972, director="Francis Ford Coppola"),
            Movie(title="Pulp Fiction", genre="Crime", year=1994, director="Quentin Tarantino"),
            Movie(title="The Dark Knight", genre="Action", year=2008, director="Christopher Nolan"),
            Movie(title="Forrest Gump", genre="Drama", year=1994, director="Robert Zemeckis")
        ]

    def get_movies(self):
        return [movie.model_dump() for movie in self.movies]

movie_db = MovieDatabase()

def get_movie_recommendation(preferences: str) -> MovieRecommendation:
    movies = movie_db.get_movies()
    prompt = ChatPromptTemplate.from_template(
        """Based on the user's preferences: '{preferences}', recommend a movie from the following list:
        {movies}
        
        Provide your recommendation in the following format:
        {format_instructions}
        """
    )
    
    parser = PydanticOutputParser(pydantic_object=MovieRecommendation)
    
    chain = LLMChain(
        llm=llm,
        prompt=prompt,
        output_parser=parser
    )
    
    result = chain.run(
        preferences=preferences,
        movies=json.dumps(movies, indent=2),
        format_instructions=parser.get_format_instructions()
    )
    
    return result

recommendation_tool = StructuredTool.from_function(
    func=get_movie_recommendation,
    name="MovieRecommendation",
    description="Recommends a movie based on user preferences"
)

tools = [recommendation_tool]

agent = StructuredChatAgent.from_llm_and_tools(llm=llm, tools=tools)
agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent,
    tools=tools,
    verbose=True
)

result = agent_executor.run("I'm in the mood for a classic crime movie. What do you recommend?")
print(result)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I can use the MovieRecommendation tool to suggest a classic crime movie.

Action:
```
{
  "action": "MovieRecommendation",
  "action_input": {
    "preferences": "classic crime movie"
  }
}
```
[0m
Observation: [36;1m[1;3mrecommended_movie=Movie(title='The Godfather', genre='Crime', year=1972, director='Francis Ford Coppola') reason="The Godfather is a classic crime movie, widely regarded as one of the greatest films of all time, and its genre and release year match the user's preferences."[0m
Thought:[32;1m[1;3mAction:
```
{
  "action": "Final Answer",
  "action_input": "I recommend 'The Godfather' (1972) directed by Francis Ford Coppola. It's a classic crime movie, widely regarded as one of the greatest films of all time."
}
```[0m

[1m> Finished chain.[0m
I recommend 'The Godfather' (1972) directed by Francis Ford Coppola. It's a classic crime movie, widely regarded as one of the greatest films of all tim

## Conclusion

In this tutorial, we've explored how to work with structured data in LangChain and LangGraph applications. We've covered:

1. Using Pydantic for data modeling and validation
2. Creating structured inputs and outputs with Pydantic and LangChain
3. Using the JSON Toolkit for complex data manipulation
4. Integrating structured data with LangChain and LangGraph in a movie recommendation system

These techniques allow you to create more robust and type-safe applications, making it easier to work with complex data structures in your AI-powered systems.

## Next Steps

To further improve your skills in working with structured data in LangChain and LangGraph applications, consider the following:

1. Experiment with more complex Pydantic models and nested structures
2. Explore other LangChain tools and agents that work with structured data
3. Implement error handling and fallback strategies for parsing failures
4. Integrate external APIs or databases to work with real-world data
5. Develop a full-fledged application that combines multiple structured data techniques